Elasticsearch cluster on kubernetes is highly unstable

We have a three node elasticsearch cluster running on a kubernetes 1.6.4 cluster. We have spanned a dedicated AWS r4.large only for the 3 elasticsearch containers.
We are experiencing many issues with this setup.
We randomly get the below errors and the cluster nodes will restart and then the es cluster will go into an unusable state.

[2017-06-16T13:19:46,176][ERROR][o.e.a.b.TransportBulkAction] [elasticsearch-logging-0] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@22924ec0 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@adc82c9[Running, pool size = 2, active threads = 2, queued tasks = 265, completed tasks = 1450]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) ~[elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_131]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:89) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.ingest.PipelineExecutionService.executeBulkRequest(PipelineExecutionService.java:74) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:508) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:136) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:85) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:149) ~[?:?]

Can anyone help in fixing this up?

PS.
During these errors CPU usage is high on each nodes and as a result the nodes are getting restarted, after which the cluster will never stabilize (some data shards check I guess) and the nodes will keep on restarting. We are using fluentd to push kubernetes container logs to elasticsearch. We are facing this issue for a long time. Once the cluster is up and running it will run forever without any issues until a node gets restarted.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.