Context
We have a Java application with embedded Elasticsearch 6.0.0. We make use of ES Bulk Request Processor to get tons of documents loaded into our (single) index.
The Problem
So far so good, except in cases where we have to load a massive database like https://www.medline.com/. Then, sometimes our load process hangs after a day or two of document loading, and we don't get any feedback out of it.
When checking the thread dumps of the process, we can notice that it's always at this point of the flow (it's like Semaphore never gets released, so the process hangs):
"main" #1 prio=5 os_prio=0 tid=0x00007f50d400d800 nid=0x67e waiting on condition [0x00007f50db9e5000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000739c2e2f0> (a java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:63)
at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:323)
at org.elasticsearch.action.bulk.BulkProcessor.executeIfNeeded(BulkProcessor.java:314)
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:271)
- locked <0x00000007335849b0> (a org.elasticsearch.action.bulk.BulkProcessor)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:254)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:250)
at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:236)
at com.myapp.base.loader.BulkDocumentLoader.sendDocumentToElasticSearch(BulkDocumentLoader.java:26)
at com.myapp.base.loader.ESLoader.bulkLoad(ESLoader.java:416)
at com.myapp.base.run.CommandRunner.load(CommandRunner.java:292)
at com.myapp.base.run.MainLoader.main(MainLoader.java:97)
Locked ownable synchronizers:
- None
Environment:
- Embedded ES 6.0 (I know we have to move forward, but it's not possible for right now)
- Java application running either on Debian or Docker (java 8u232). Each loader process is a child thread run individually (two loader process are never run at the same time) with:
/usr/local/openjdk-8/bin/java -Xmx3g -Xms3g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 /opt/lib/*:. -jar /opt/MyApp.jar
Cluster:
{
"cluster_name" : "mycluster_ds",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 3,
"active_shards" : 3,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
Bulk Processor setup:
Bulk Size = 20mb
Bulk Actions = -1
Bulk Concurrent Threads = (all the available CPUs number. Tested on servers both with 8 or 20)
Flush Interval = -1
Backoff Policy with 100ms and 3 max number of retries
So I'd like to know if anyone has ever experienced something like that or has any ideas of what I can be missing. Thanks!