Process Cluster Event Timeout Exception on put-mapping

Hi All,

Occasionally I am getting following exception on my 6 node ES cluster 2.4.6

2018-04-08 03:04:09,903][DEBUG][action.admin.indices.mapping.put] [PROD Node1] failed to put mappings on indices [[index_XXXXX]], type [TYPE_yyyy]
ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [TYPE_yyyy]) within 30s]
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:361)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I have increased the timeout using following code

		client.admin().indices().preparePutMapping(index).setType(mapping).
				setSource(m.get(schemaMapping)).setTimeout("300s")
				.setMasterNodeTimeout("300s").execute().actionGet()

Even with this, I am still getting 30 sec timeouts. Why would it timeout even when I increased timeout to 5 minutes.

Any help is appreciated

Thanks a bunch

IIRC many improvements happened the past years in 5.x and 6.x.

I'd suggest upgrading.

Each update to the cluster state is single threaded in order to ensure consistency, so a large cluster state or very frequent updates to it can cause delays.

You should be able to run the following command to get the size of your cluster state: curl -XGET 'localhost:9200/_cluster/state/master_node/*?pretty'

How many indices and shards do you have in the cluster? Are you using dynamic mappings? How many fields do you have per index? Do you have any indices where you have a very large and potentially ever growing list of types?

Here is the information

  1. We have around 5000 indices and 10,000 shards
  2. We are not using dynamic mappings
  3. We do have some indices with around 60 types and fields upto 500 ( 60 and 500 are few of largest ones)

I thought, since 2.x cluster state propagation is incremental that master communicates only deltas. Also, why setting 5 minutes timeout is not honored?

Thank you

That is a lot of indices and mappings, which could mean a large cluster state. Cluster state propagation is incremental in Elasticsearch 2.x, but if it is very large processing may still be slowed down. Can you check the size of your cluster state using the cluster state API?

Christian,

It is around 500MB if i dump the output of cluster state into a text file.

Thanks
Ashok

What is the compressed size as reported by the API I linked to?

API doesn't return this information. I am not sure if it available on 2.4 version we are on

Thanks
Ashok

That is quite large and may very well cause the slowness. How come you have so many indices with just one primary and one replica shard?

It is 2 shards per index and we have 1 replica. There is no easy way for us to bring down number of indices based on our architecture. Is it advisable to consider multiple clusters as opposed to adding more nodes to our cluster, given that our indices number will keep growing? Currently, we have 6 M4.4x large machines (16vCPU, 64GB ram) in our cluster.

In previous response, I mentioned we have 5000 indices and that is incorrect. Number of indices is 2500 and total shards is 10000

Hi @ashokm
From personal experience it seems like you have a very high ratio of indices and shards per nodes. As each shard takes resources, I'd try to amend the data structure so it could fit into less indices (and maybe use filtered aliases to keep queries intact).

Aside from that, I can say that while elasticsearch v6.2.3 indeed enhances many of the index and cluster operations, numbers such as yours may still get update_mapping timeouts (evidently shown in https://github.com/elastic/elasticsearch/issues/30370).

Thank you so much Lior. We will re-visit our architecture to reduce the number of indices

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.