Hello!
We are having some issues while we are trying to create mappings on our index. We are seeing a sudden burst of ProcessClusterEventTimeoutExceptions
for PUT mapping.
Stack from es-master02:
[es-master02] failed to put mappings on indices [[index_name]], type [type_name]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping [type_name]) within 30s
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:263)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Error from client (Ruby gem elasticsearch-1.0.14):
{"error":"RemoteTransportException[[es-master02][inet[/<IP>:9300]][indices:admin/mapping/put]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [type_name]) within 30s]; ","status":503}
Here's our cluster config:
3x master nodes each with 1 core / 4 GB RAM / 1 Gbps uplinks
10x data nodes each with 12 core / 64 GB RAM (28 GB on JVM HEAP and 36 GB reserved for FS cache) / 2.4 TB disk over 3x 800 GB SSD / 100 Mbps uplinks
For relevant index: 20 shards, ~41K mappings, 925M documents, 11TB indexed
Elasticsearch 1.4.4 running on docker.
-
We have tried increasing port speeds on master nodes which gained a slight improvement of ping times.
-
We have tried reducing the number of parallel indexing requests going on which for a while reduced the number of 503s but that is no longer the case (as in almost all PUT mappings get 503s now)
-
We have also added additional data nodes recently (because we were running out of disk space)
-
We haven't tried increasing port speed on data nodes (would take a full rolling cluster restart which would take a lot of time and this is just a hunch)
-
We haven't tried moving off docker (again a hunch, don't know if it is going to help)
-
We haven't tried beefing up the master nodes (although we aren't seeing CPU or heap pressure on these nodes)
I guess we are wondering what the likely cause could be and recommended solutions to mitigate.
Thanks for your time.