Better cluster configuration for 63 terrabyes of data

Hey gang, out team has an Elasticsearch cluster filled with HTML documents. Here is some data abour our current configuration:

  • It consists of 10 data nodes, each with a heap.size of 32GB
  • It contains a total of 600 shards
  • Each shard is 1000GB in size
    Due to this configuration, we are looking into a better configuration. Here is the new plan:
  • Have 20 shards per gigabyte of heap per data node. For 32GB per data node, thats 6,400 possible shards.
  • To keep each shard at 30GB or less, we would need 2,100 shards.
    Since an index can't have more then 1000 shards inside of it, we are trying to decide what the best course of action would be. We are considering having two indexes with an alias connecting them. Does the math look good to eveyone or am I missing a vital piece here? Does anyone have any suggestions on how we can improve pur configuration here?

This is way beyond the official recommendation on shard sizing. In general the ideal shard size will depend on the data and queries. Larger shards can mean slower queries, but if a shard size of 50GB is able to meet your query SLAs this is often a good size to aim for.

If I calculate correctly this comes to 600TB and not 63TB. What is correct?

I hope the heap size is a bit les than 32GB so you benefit from the use of compressed pointers. You should be able to see this logged on startup.

This is an old guideline that is not necessarily applicable in the most recent versions of Elasticsearch. In order versions it was never a recommended level but rather a maximum level as a lot of users ended up in trouble by oversharding their clusters.

Splitting it up into multiple indices might be a good idea and you can use an alias to query the data. Note that you will need to index/update/delete documents directly through the index name though and can not do this through the alias.

Can you describe the use case in more detail? What are the querying and index/update patterns?

Are all queries run against all data? Is your current cluster meeting your performance requirements? If not, what issues are you having with the current setup?

Which version of Elasticsearch are you using?

I'm not sure this is the main concern with shards larger than 50GiB. A tiny number of enormous shards will limit the concurrency of individual searches (in older ES versions) which could have an impact on individual search latency (but not overall throughput), but with 60 shards per node this is likely not a huge concern. Recent ES versions will parallelize searches by segment rather than by shard so maybe not at all important any more.

Oversized shards are a bit more of a pain to manage (e.g. they take correspondingly longer to recover after a node failure) but it's not a huge deal IMO.

As Christian says, recent ES versions can cope with many more shards than this.

This is the key question: what problems are you trying to solve here? Do you really need to change anything?

1 Like

Hey crew,

Thanks for the fast replies! Our current cluster configuration of 1000GB per index is not meeting our query reqiorements. Since we are querying against HTML document, we query a single field and check if it contains the substring we are passing in. This works fine when we query one keywords against it, however, when we query 10 different substrings against it, it leads to huge CPU utilization spikes.

Could you please provide the full output of the cluster stats API?

What does a typical query look like? What does a sample document look like? What are the mappings of the index?

Can you explain why this is undesirable? If you ask ES to do some work, it seems like you would want it to use all the resources it can to complete the work as fast as possible.