I am running Elastic cluster with 3 nodes( with 3 masters [ 2 data only nodes] )
The version is 6.4. Each instance of ES is running with 8GB as Heap. I see following exceptions in ES and Logstash
Exceptions from Elastic search:
[2018-11-03T01:48:52,924][ERROR][o.e.a.b.TransportBulkAction] [linux02] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$1@1e6e916f on EsThreadPoolExecutor[name = linux02/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@42b4f351[Running, pool size = 4, active threads = 4, queued tasks = 255, completed tasks = 10127143]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.4.0.jar:6.4.0]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_51]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_51]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.ingest.PipelineExecutionService.executeBulkRequest(PipelineExecutionService.java:59) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:495) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:134) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:85) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:128) [x-pack-security-6.4.0.jar:6.4.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:87) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:76) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:407) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.client.support.AbstractClient.bulk(AbstractClient.java:484) [elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk$$Lambda$3040/1268768997.accept(Unknown Source) [x-pack-monitoring-6.4.0.jar:6.4.0]
at org.elasticsearch.xpack.core.ClientHelper.executeAsyncWithOrigin(ClientHelper.java:74) [x-pack-core-6.4.0.jar:6.4.0]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:108) [x-pack-monitoring-6.4.0.jar:6.4.0]
Exceptions from Logstash:
[2018-11-03T01:39:52,814][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@5f2c5137 on EsThreadPoolExecutor[name = linux02/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@42b4f351[Running, pool size = 4, active threads = 4, queued tasks = 241, completed tasks = 9729978]]"})
I can see that logstash errors are caused due to ES issues. The errors in see in ES has discussed in other threads, but I could not find a solution to fix the problem. As suggested in other threads I did not go with thread pool or queue size change.
I see ES is perform constant GC during this time.
Could you please let us know what is the best strategy to mitigate this issue?
Thanks for your reply. I will look into these articles. Here is the other information that you request:
How many shards do you have in your cluster?
900 shards
How many of these are you actively indexing into?
Correct me if I am wrong on active shards from below.
Only two indexes are actively written into ES cluster as index name changes with new date. Each index is with 5 shards active and I suspect polldata index is the problem due to size.
requestdata-2018.11.02 2 r STARTED 181 168kb linux02
requestdata-2018.11.02 3 r STARTED 194 240.1kb linux02
requestdata-2018.11.02 1 r STARTED 167 133.3kb linux02
requestdata-2018.11.02 4 r STARTED 183 144.7kb linux02
requestdata-2018.11.02 0 r STARTED 186 185.1kb linux02
polldata-2018.11.02 2 r STARTED 17325657 4.3gb linux02
polldata-2018.11.02 1 r STARTED 17301462 4.4gb linux02
polldata-2018.11.02 3 r STARTED 17309500 4.3gb linux02
polldata-2018.11.02 4 r STARTED 17319192 4.4gb linux02
polldata-2018.11.02 0 r STARTED 17323642 4.4gb linux02
How many clients do you have indexing into the cluster?
Only one which is logstash. One Logstash writes data into ES cluster for two indexes.
It all started when we restarted logstash scanning files starting from beginning of the day and bulk data is beginning published to ES.
If it started when you initiated a scan of files from the beginning of the day, it is possible that you are writing to additional time-based indices as they are created based on UTC time zone. If you have a lot of worker threads on your Logstash instance(s), there could be a lot of parallel bulk requests each targeting 20 or more shards. As it looks like all your primary shards are on one node, the queue could fill up quite quickly here, resulting in bulk rejections.
If your average shard size is tend to be smaller than 20-40GB, you should be able to reduce the number of primary shards for your indices through index templates. This will reduce the probability of seeing bulk rejections going forward.
I do not see the issue now, it seems the there is less load on weekend and stabilized. I do not have multiple logstash instances at the moment and only one instance and using default workers.
I see following stats for the node that has the issue. Especially with the heap size. The Heap size is at 8Gb and usage is at 6Gb and I see constant GC allocation failure in the gc logs.
I am planning to shirk the shards from 5 to 1 for each Index as max usage I can see is at 4 GB for each shard.
I am allocating 30GB as max heap. But I have a question how to scale up by a adding new nodes and making sure the node is taking the new data.
example:
Node 1 with 30GB
Node 2 with 30GB
and Node3 with only master and no node.
What if both Node1 is filled with Primary and Node2 with secondary, then adding Node 4 how do we ensure the new node will take the incoming writes?
The cluster only receives inserts and search queries(very minimum no heavy aggregation) and no updates to any existing documents. We just ship all logs to this cluster and retrieve when required.
We want to store the logs forever.
GC Log:
2018-11-05T08:44:52.348-0500: 226702.660: Total time for which application threads were stopped: 0.0208133 seconds, Stopping threads took: 0.0000797 seconds
2018-11-05T08:44:53.125-0500: 226703.437: [GC (Allocation Failure) 226703.437: [ParNew
Desired survivor size 17432576 bytes, new threshold 6 (max 6)
age 1: 2359888 bytes, 2359888 total
age 2: 1244416 bytes, 3604304 total
age 3: 668712 bytes, 4273016 total
age 4: 657688 bytes, 4930704 total
: 286485K->7061K(306688K), 0.0251225 secs] 3605324K->3325900K(8354560K), 0.0252621 secs] [Times: user=0.07 sys=0.01, real=0.03 secs]
2018-11-05T08:44:53.151-0500: 226703.462: Total time for which application threads were stopped: 0.0260867 seconds, Stopping threads took: 0.0000628 seconds
2018-11-05T08:44:54.219-0500: 226704.531: [GC (Allocation Failure) 226704.531: [ParNew
Desired survivor size 17432576 bytes, new threshold 6 (max 6)
age 1: 8641816 bytes, 8641816 total
age 2: 888304 bytes, 9530120 total
age 3: 322840 bytes, 9852960 total
age 4: 665144 bytes, 10518104 total
age 5: 654752 bytes, 11172856 total
: 279701K->13859K(306688K), 0.0233829 secs] 3598540K->3332698K(8354560K), 0.0235458 secs] [Times: user=0.07 sys=0.01, real=0.03 secs]
2018-11-05T08:44:54.243-0500: 226704.554: Total time for which application threads were stopped: 0.0243791 seconds, Stopping threads took: 0.0000688 seconds
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.