Java process hangs during BulkRequestProcessor add - Waiting for Semaphore

Context

We have a Java application with embedded Elasticsearch 6.0.0. We make use of ES Bulk Request Processor to get tons of documents loaded into our (single) index.

The Problem

So far so good, except in cases where we have to load a massive database like https://www.medline.com/. Then, sometimes our load process hangs after a day or two of document loading, and we don't get any feedback out of it.

When checking the thread dumps of the process, we can notice that it's always at this point of the flow (it's like Semaphore never gets released, so the process hangs):

"main" #1 prio=5 os_prio=0 tid=0x00007f50d400d800 nid=0x67e waiting on condition [0x00007f50db9e5000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000739c2e2f0> (a java.util.concurrent.Semaphore$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
	at org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:63)
	at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:323)
	at org.elasticsearch.action.bulk.BulkProcessor.executeIfNeeded(BulkProcessor.java:314)
	at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:271)
	- locked <0x00000007335849b0> (a org.elasticsearch.action.bulk.BulkProcessor)
	at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:254)
	at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:250)
	at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:236)
	at com.myapp.base.loader.BulkDocumentLoader.sendDocumentToElasticSearch(BulkDocumentLoader.java:26)
	at com.myapp.base.loader.ESLoader.bulkLoad(ESLoader.java:416)
	at com.myapp.base.run.CommandRunner.load(CommandRunner.java:292)
	at com.myapp.base.run.MainLoader.main(MainLoader.java:97)

   Locked ownable synchronizers:
	- None

Environment:

  • Embedded ES 6.0 (I know we have to move forward, but it's not possible for right now)
  • Java application running either on Debian or Docker (java 8u232). Each loader process is a child thread run individually (two loader process are never run at the same time) with:
/usr/local/openjdk-8/bin/java -Xmx3g -Xms3g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 /opt/lib/*:. -jar /opt/MyApp.jar

Cluster:

{
  "cluster_name" : "mycluster_ds",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 3,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Bulk Processor setup:

Bulk Size = 20mb
Bulk Actions = -1
Bulk Concurrent Threads = (all the available CPUs number. Tested on servers both with 8 or 20)
Flush Interval = -1
Backoff Policy with 100ms and 3 max number of retries

So I'd like to know if anyone has ever experienced something like that or has any ideas of what I can be missing. Thanks!

It seems that you are aware that running Elasticsearch in embedded mode stopped being supported with the release of Elasticsearch 5.0. I would therefore expect very few people here on the forum (if any) to have a similar setup as you do, which may make it hard to get any help. I am not able to help on this, but would strongly recommend moving away from embedded mode and onto a more standard and supported architecture.

Hey, @Christian_Dahlqvist, thanks for your reply. Yeah, you're totally right and we are aiming for that. I just needed a workaround for this very moment to keep what we already have running well for a little while.

Anyway, I could solve this issue by reducing the number of concurrent requests for the bulk processor instead of letting it use all of them.

I haven't looked into this in detail but there was a deadlock bug fixed in the BulkProcessor by the following PR; perhaps you are hitting this bug since you're using version 6.0.0 which is very old.

1 Like

That's interesting. As within the ticket, it was backported to 6.8v, that one I can accomplish. Thanks for sharing this!