EsRejectedExecutionException indexing 10 small (2k) docs per second

Hello,

I have 1 ES node (ES 2.1.1) on RHEL 6.5, Oracle jdk1.7.9_79, 8 Cores and 64 GB of RAM. Heap size = 1G, Index with 0 Replicas, indexing about 10 documents per second (one per thread).

I’m getting these messages occassionally in logs:

EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@1919c887 on EsThreadPoolExecutor[index, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@51c35f6d[Running, pool size = 8, active threads = 8, queued tasks = 200, completed tasks = 41484992]]]

This is one of the lines of my Access log (the interesting values in bold)
1,10.7.245.200,2016-07-13 17:00:42.286,A,1,A,N,E,S,7,A,,1,A,S,4,S,1700,0000,2359,3600000,N,192.168.251.56,gzip,3781,1091,637,920,ok,2016-07-13 17:00:55.627

So this request came into our system at 17:00:42, it took to our proxy 920ms to get the response from api servers and this response was returned to the origin at 17:00:55. After getting the response from the server the document is indexed in elastic search and the ending timestamp is set. That means that this document took about 12 seconds to be indexed in ES. Since this situation happens for several seconds, I understand that the queue becomes full and the message is triggered.

There is no GC issues at that moment (in fact the minor generations are triggered every 10 minutes due to the low activity, and no Full GC or CMS has occurred at that moment). CPU is 97% idle and an iowait almost 0%. No messages in syslogs.

Could anybody give me any clue about what’s going on? At least a starting point to investigate ... ES logs do not show any incidence and our application is working fine (it has several components, and ES is just one of them, the rest are ok).

Thanks,

Joan.

You can certainly crank up so logs. See these instructions on how to change the log levels on the fly. I might set index to debug and see if that gets you anything. You could also look at graphing stats like delete and merge. Marvel does that if you want to use it but whatever tool you want to use to capture the statistics and graph them is fine by me.

This confusing exception just means you are sending too many concurrent indexing requests (> 200 in your case) to this node.

It's better to use the bulk indexing API instead instead of the index (one document) API, indexing multiple documents in one request.

You can also increase the thread pool queue sizes on your node, but that's dangerous since you can tie up a lot of JVM heap holding onto those requests.

Really ES ought to have a streaming bulk indexing API so that things like how many client threads to use, size of bulk request, queue sizes, rejected execution exceptions, JVM heap tied up, all just go away.

Sorry for the delay, I depend on the client who is on holiday intermitently ...

After performing the following actions:

curl -XPUT 'http://192.168.253.19:9200/_cluster/settings' -d '{ "transient" : { "logger.replication" : "DEBUG" } }'
curl -XPUT 'http://192.168.253.19:9200/1*/_settings' -d '{ "index.indexing.slowlog.threshold.index.info": "1s" }'

Nothing appears in any log. Are these sentences ok? The first one applies debug log to the replication module because the exception comes from there. The second one applies to the indexing slow log for all indexes with name starting with '1' (my indexes are 1_name1, 1_name2, etc).

I also attach the output for curl 'http://192.168.1.70:9200/_stats/indexing,flush,merge,refresh' in this link:
http://www.grupoventus.com/download/elastic.txt

Any help would be very appreciated.