I have 1 ES node (ES 2.1.1) on RHEL 6.5, Oracle jdk1.7.9_79, 8 Cores and 64 GB of RAM. Heap size = 1G, Index with 0 Replicas, indexing about 10 documents per second (one per thread).
I’m getting these messages occassionally in logs:
EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@1919c887 on EsThreadPoolExecutor[index, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@51c35f6d[Running, pool size = 8, active threads = 8, queued tasks = 200, completed tasks = 41484992]]]
This is one of the lines of my Access log (the interesting values in bold)
1,10.7.245.200,2016-07-13 17:00:42.286,A,1,A,N,E,S,7,A,,1,A,S,4,S,1700,0000,2359,3600000,N,192.168.251.56,gzip,3781,1091,637,920,ok,2016-07-13 17:00:55.627
So this request came into our system at 17:00:42, it took to our proxy 920ms to get the response from api servers and this response was returned to the origin at 17:00:55. After getting the response from the server the document is indexed in elastic search and the ending timestamp is set. That means that this document took about 12 seconds to be indexed in ES. Since this situation happens for several seconds, I understand that the queue becomes full and the message is triggered.
There is no GC issues at that moment (in fact the minor generations are triggered every 10 minutes due to the low activity, and no Full GC or CMS has occurred at that moment). CPU is 97% idle and an iowait almost 0%. No messages in syslogs.
Could anybody give me any clue about what’s going on? At least a starting point to investigate ... ES logs do not show any incidence and our application is working fine (it has several components, and ES is just one of them, the rest are ok).