503 PUT mapping exceptions with large number of mappings

Hello!

We are having some issues while we are trying to create mappings on our index. We are seeing a sudden burst of ProcessClusterEventTimeoutExceptions for PUT mapping.

Stack from es-master02:
[es-master02] failed to put mappings on indices [[index_name]], type [type_name]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping [type_name]) within 30s
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:263)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Error from client (Ruby gem elasticsearch-1.0.14):
{"error":"RemoteTransportException[[es-master02][inet[/<IP>:9300]][indices:admin/mapping/put]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [type_name]) within 30s]; ","status":503}

Here's our cluster config:
3x master nodes each with 1 core / 4 GB RAM / 1 Gbps uplinks
10x data nodes each with 12 core / 64 GB RAM (28 GB on JVM HEAP and 36 GB reserved for FS cache) / 2.4 TB disk over 3x 800 GB SSD / 100 Mbps uplinks
For relevant index: 20 shards, ~41K mappings, 925M documents, 11TB indexed
Elasticsearch 1.4.4 running on docker.

  • We have tried increasing port speeds on master nodes which gained a slight improvement of ping times.

  • We have tried reducing the number of parallel indexing requests going on which for a while reduced the number of 503s but that is no longer the case (as in almost all PUT mappings get 503s now)

  • We have also added additional data nodes recently (because we were running out of disk space)

  • We haven't tried increasing port speed on data nodes (would take a full rolling cluster restart which would take a lot of time and this is just a hunch)

  • We haven't tried moving off docker (again a hunch, don't know if it is going to help)

  • We haven't tried beefing up the master nodes (although we aren't seeing CPU or heap pressure on these nodes)

I guess we are wondering what the likely cause could be and recommended solutions to mitigate.

Thanks for your time.

Use parameter master_timeout to increase the time the requesting node is allowed to wait for the master to respond.

Thank you, Jörg!

We tried the master_timeout option and while it did mitigate the occurrence of errors for a short time, unfortunately, it didn't seem to fix it.
First we set the master_timeout to 60s. For a short time the cluster goes into a steady state successfully creating mappings but then a burst of ProcessClusterEventTimeoutExceptions happen. The cluster recovers in a few minutes but another burst of exceptions happen.
We increased the master_timeout to 90s to see if that would help. But saw the same behaviour.

We also put this on an entirely new index (on the same cluster) but this seems to have no effect in reducing the number of 503s.

I guess the underlying question is what is taking so long to run put mapping in the first place that would require such a long timeout. At this point the master should have already created the mapping but is only waiting for an ack from put mappings.

Thank you!