-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Hello guys, I’m using Apache Ignite 2.16.0/2.17.0 in a production environment with a 15 server-nodes cluster.
A deadlock occurred when one of the nodes(Replace with ip1) was executing org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery).
Thread stack is as follows:
"xxx" Id=317 TIMED_WAITING on java.util.concurrent.CountDownLatch$Sync@9342695
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.CountDownLatch$Sync@9342695
at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(Unknown Source)
at [email protected]/java.util.concurrent.CountDownLatch.await(Unknown Source)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:8228)
at org.apache.ignite.internal.processors.query.h2.twostep.ReduceQueryRun.tryMapToSources(ReduceQueryRun.java:218)
at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1065)
at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:448)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$5.iterator(IgniteH2Indexing.java:1447)
at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iter(QueryCursorImpl.java:102)
at org.apache.ignite.internal.processors.query.h2.RegisteredQueryCursor.iter(RegisteredQueryCursor.java:91)
at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:92)
By checking the logs, it was found that one of the nodes in the cluster restarted while the query was being executed.
reboot system boot 5.10.0-136.12.0. Mon Mar 4 19:51 - 15:10 (3+19:19)
At this time, checking the latest topology baseline, it was found that the node where the thread was stuck was only the one with my own IP:
globalState=DiscoveryDataClusterState [state=ACTIVE, lastStateChangeTime=xxx, baselineTopology=BaselineTopology [id=0, branchingHash=-708844738, branchingType='New BaselineTopology', baselineNodes=[ip1:port1]]
My ignite configuration is as follows:
IgniteConfiguration igniteCfg = new IgniteConfiguration();
TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
ipFinder.setAddresses(addressList:[15 nodes ip]).setShared(false);
TcpDiscoverySpi spi = new TcpDiscoverySpi();
spi.setIpFinder(ipFinder);
DataRegionConfiguration dataRegionConfiguration = new DataRegionConfiguration();
dataRegionConfiguration.setPersistenceEnabled(false);
igniteCfg.setDiscoverySpi(spi).setDataStorageConfiguration(dataRegionConfiguration);
CacheConfiguration cacheCfg = new CacheConfiguration<>(cacheName);
cacheCfg.setCacheMode(CacheMode.PARTITIONED)
.setBackups(0)
.setIndexedTypes(Integer.class, AlarmRecord.class)
.setSqlFunctionClasses(ExtIgniteFunctions.class)
.setRebalanceDelay(-1)
.setOnheapCacheEnabled(false)
.setSqlOnheapCacheEnabled(false)
.setQueryParallelism(2)
.setRebalanceMode(CacheRebalanceMode.NONE)
.setAffinity(affFunc);
Finally, I would appreciate guidance on:
Recommended production configuration
Any known limitations or best practices to ensure cluster stability and avoid full outages
How should I configure it to ensure that queries already executed during the restart of some nodes in the cluster do not get stuck as described above?
Thank you for your guidance.