Restarting cluster takes long time. Update mapping times out

Whenever I restart my cluster completely, do a rolling restart or a node simply disconnects from the cluster due to a network hickup a lengthy recovery process is started that can take multiple hours to complete. All shards in the (restarting / recovering) node become unassigned and once the node is back online are reassigned or 'initialized' few at a time. The cluster state is yellow at this time because there remain enough replicas to keep up the service.

However, during this time, creating indices or updating mappings of existing indices often hit a 30s timeout with the following message:

Caused by: listener timeout after waiting for [30000] ms

Our application updates the mappings of all indices on start up, meaning that if we ever need to restart both our ES cluster and application we have an outage of multiple hours!

We have the following cluster setup in one of our production environments:

  • ElasticSearch 6.2.4
  • 250 indices * 20 shards * 3 replicas = ~15k shards
  • 3 dedicated master eligible nodes
  • 8 data nodes \w 10 CPU cores, 36GB mem of which 12GB heap
  • Spread over 4 hypervisors with local SSD storage
  • 150M documents @ ~2 TB, including all replicas
  • Index size varies between empty and 100GB

My cluster state size (compressed) is about 1MB, according to _cluster/state/master_node/ : "compressed_size_in_bytes": 846247.

Indices are static and per-customer. Size of index varies greatly between customers. Index mapping are fairly complex with 100-1000 fields per index, including many nested fields. Dynamic index- and mapping creation is disabled. Each index has its own unique mapping maintained by our application.

I am not sure where these timeouts are coming from. When monitoring the load, everything seems well within normal and no single machine is completely overloaded. This is with nearly all settings on their default values.

I was able to tweak the recovery speed slightly using indices.recovery.max_bytes_per_sec=200mb (up from 40mb) and cluster.routing.allocation.node_concurrent_recoveries=8 (up from 2) which does bring the entire cluster into very high load. This, however, does not seems to affect the timeouts on mapping updates.

What can I do to make sure that my cluster remains functional in its yellow state? Are there ways to reduce the time it need to recover? Why does it take so long to 'initialize' shards to the same node that already has them, but just disconnected or restarted for a few moments? Why won't the cluster process my update mapping and time out? What is actually keeping the cluster so busy at this time, as load on all nodes doesn't show anything shocking.

I would say that you are severely oversharded have far to many shards in your cluster. Please see this blog post for more details and guidelines. 20 shards seems excessive even for an index with 100GB of data and is outright wasteful for smaller indices.

I would recommend that you adjust the number of shards per index so that the average shard size is around at least 10GB in size. This means that very small indices should have a single primary shard.

If you reindex and create indices correctly, you can later use the split index API to increase the number of shards for customers whose shards grow beyond 10-20GB in size.

This should reduce the size of the cluster state and help speed up recovery.

1 Like

Thanks @Christian_Dahlqvist ! There are already plans to reduce the number of shards greatly for existing indices. Since we cannot predict each customer growth we wanted to be flexible, but it looks like there is a limit to this flexibility :slight_smile:

However, I don't feel this solves the problem. Sure, having 14K shards on ~300 indices might be overkill, but imagine for a second we grow pretty big and now we have 14K shards on 5000 indices. I feel the problem will still remain: While the cluster is recovering or restarting, it will not accept create index / update mapping requests and timeout.

I found this pull request where a discussion is going on about prioritizing different cluster tasks, such as update mappings that causes similar issues. From the looks of it, this is something that made it into 6.3 or later. Since I am still running 6.2.4, could this be relevant?

The _split_ API looks indeed promising, but until 7.0 it will still rely on the number_of_routing_shards which is something we didn't account for during our index creation. Until we are on 7.0 (and recreate all indices there) this will not help us, sadly.

Having individual indices per customer typically do not scale well as it adds a lot of overhead and increases the size of the cluster state. You can get around this by instead having multiple smaller clusters, but the best way is typically to have smaller customers share indices.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.