We have a three node elasticsearch cluster running on a kubernetes 1.6.4 cluster. We have spanned a dedicated AWS r4.large only for the 3 elasticsearch containers.
We are experiencing many issues with this setup.
We randomly get the below errors and the cluster nodes will restart and then the es cluster will go into an unusable state.
[2017-06-16T13:19:46,176][ERROR][o.e.a.b.TransportBulkAction] [elasticsearch-logging-0] failed to execute pipeline for a bulk request org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@22924ec0 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@adc82c9[Running, pool size = 2, active threads = 2, queued tasks = 265, completed tasks = 1450]] at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) ~[elasticsearch-5.4.1.jar:5.4.1] at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_131] at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:89) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.ingest.PipelineExecutionService.executeBulkRequest(PipelineExecutionService.java:74) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:508) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:136) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:85) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.1.jar:5.4.1] at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:149) ~[?:?]
Can anyone help in fixing this up?
PS.
During these errors CPU usage is high on each nodes and as a result the nodes are getting restarted, after which the cluster will never stabilize (some data shards check I guess) and the nodes will keep on restarting. We are using fluentd to push kubernetes container logs to elasticsearch. We are facing this issue for a long time. Once the cluster is up and running it will run forever without any issues until a node gets restarted.