Skip to content

GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery) encountered a thread deadlock issue when querying data. #12623

@gswcomputing

Description

@gswcomputing

Hello guys, I’m using Apache Ignite 2.16.0/2.17.0 in a production environment with a 15 server-nodes cluster.

A deadlock occurred when one of the nodes(Replace with ip1) was executing org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery).

Thread stack is as follows:

"xxx" Id=317 TIMED_WAITING on java.util.concurrent.CountDownLatch$Sync@9342695
    at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.CountDownLatch$Sync@9342695
    at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
    at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
    at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(Unknown Source)
    at [email protected]/java.util.concurrent.CountDownLatch.await(Unknown Source)
    at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:8228)
    at org.apache.ignite.internal.processors.query.h2.twostep.ReduceQueryRun.tryMapToSources(ReduceQueryRun.java:218)
    at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1065)
    at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:448)
    at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$5.iterator(IgniteH2Indexing.java:1447)
    at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iter(QueryCursorImpl.java:102)
    at org.apache.ignite.internal.processors.query.h2.RegisteredQueryCursor.iter(RegisteredQueryCursor.java:91)
    at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:92)

By checking the logs, it was found that one of the nodes in the cluster restarted while the query was being executed.
reboot system boot 5.10.0-136.12.0. Mon Mar 4 19:51 - 15:10 (3+19:19)

At this time, checking the latest topology baseline, it was found that the node where the thread was stuck was only the one with my own IP:

globalState=DiscoveryDataClusterState [state=ACTIVE, lastStateChangeTime=xxx, baselineTopology=BaselineTopology [id=0, branchingHash=-708844738, branchingType='New BaselineTopology', baselineNodes=[ip1:port1]]

My ignite configuration is as follows:

IgniteConfiguration igniteCfg = new IgniteConfiguration();
TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
ipFinder.setAddresses(addressList:[15 nodes ip]).setShared(false);
TcpDiscoverySpi spi = new TcpDiscoverySpi();
spi.setIpFinder(ipFinder);
DataRegionConfiguration dataRegionConfiguration = new DataRegionConfiguration();
dataRegionConfiguration.setPersistenceEnabled(false);
igniteCfg.setDiscoverySpi(spi).setDataStorageConfiguration(dataRegionConfiguration);
CacheConfiguration cacheCfg = new CacheConfiguration<>(cacheName);
cacheCfg.setCacheMode(CacheMode.PARTITIONED)
.setBackups(0)
.setIndexedTypes(Integer.class, AlarmRecord.class)
.setSqlFunctionClasses(ExtIgniteFunctions.class)
.setRebalanceDelay(-1)
.setOnheapCacheEnabled(false)
.setSqlOnheapCacheEnabled(false)
.setQueryParallelism(2)
.setRebalanceMode(CacheRebalanceMode.NONE)
.setAffinity(affFunc);

Finally, I would appreciate guidance on:
Recommended production configuration
Any known limitations or best practices to ensure cluster stability and avoid full outages
How should I configure it to ensure that queries already executed during the restart of some nodes in the cluster do not get stuck as described above?
Thank you for your guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions