Timeout exception with many time-based indices after 00:00


I am putting various kind of logs into Elasticsearch with daily time-based index (somedata-YYYY.MM.DD).
Recently, I have started to put many other kind of logs into ES, then ES started to logs many ProcessClusterEventTimeoutException after 00:00 AM.

[2016-10-19 00:00:30,858][DEBUG][action.admin.indices.mapping.put] [myhostname] failed to put mappings on indices [[somedata-2016.10.18]], type [fluentd]

ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [fluentd]) within 30s]
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Actually, I have more than 200 indices every day, so when the date changes, almost all of them will need new index for the next day.

I suspected that some node (maybe master node) was under high CPU load, but as show below, CPU usage is under 5% in all nodes(servers), and CPU usage is dropped during this time period (00:00-00:10 AM)

It seems like ES is not under high load but something is blocking the operation. Does anyone suggest how can I investigate this?

By the way, I am running 10 nodes Elasticsearch cluster with version 2.4.0.


Elasticsearch's cluster state management is single threaded for simplicity so I wouldn't expect to see the load average really spike.

Are you using dynamic mappings? Those can cause lots of extra cluster state changes as new properties are dynamically added. It is usually much quicker to set up the mapping before hand either by creating the index before it is needed with the mapping you want or by setting up templates. Creating the indexes before they are needed is a fairly nice thing to do because you can stagger them or just set the timeout to some super high number.

200 daily indices sounds like a lot. What is the rationale behind having so many? How many shards does that result in on a daily basis? What is the average shard size? How long do you keep your data in the cluster?

Thanks for the suggestion.

Ah, I didn't know that the state management is single threaded. It explains our situation.
Yes, I am using dynamic mapping. I think I should try creating necessary indices beforehand.

The reason why we have 200 daily indices is that we are running log collection and analytics platform, which collect various kind of logs from many applications.

Number of shard is 10 per index, resulting in 2000 shards per day. Average shard size is 127MB. We keep the data only 2 days. (Because we use Hadoop HDFS for long-term storage)

Those amount to very small shards. It'd probably make sense to combine a
few of them.