What is the optimal shard size?

Hello,

I have an index that contains 1 terabyte of events. The index currently has 1 shard and 1 replica.

I want to change the index settings and increase both the number of shards and replicas, but I don't know the optimal size for a single shard. I want the queries to be very fast.

What is the optimal shard size?

Between 10 GB and 50 GB.

Check this documentation.

The ideal shard size will depend on a number of factors, so there is not necessarily any single shard size that is optimal across the board. Some of these factors are:

  • Type and complexity of data and mappings
  • Type of queries used and size of result set returned
  • Number of concurrent queries
  • Level of concurrent indexing
  • Cluster size
  • Type of hardware used

It would help if you could shed some light on these. It would also be useful if you could tell us which version of Elasticsearch you are using.

Alright, I’ll answer as accurately as possible.

The data and mappings can be considered complex—extremely complex, in fact—due to heavy use of nested structures and a very large number of fields (more than 1,000).
The data itself is a mixed type, including text, numbers, and a wide range of special characters.

Query patterns in use:

  • exists
  • wildcard — used heavily; almost every query includes at least one wildcard
  • term
  • match
  • match_all
  • must

The result set size for some queries (e.g., last 7 days of data) can reach up to 666 million documents.

Query concurrency:

  • The number of concurrent queries varies; currently up to 19, sometimes more, sometimes fewer.

Indexing activity:

  • 3 active indexing processes

Cluster details:

  • Total data size: 11 TB
  • Number of nodes: 5

Hardware specs per node:

  • 64 GB RAM
  • 32 CPU cores
  • One node has 101 GB RAM
  • HDD storage on all nodes

Elasticsearch version:

  • 8.17.0

That is clearly not a standard use case so I believe you will need to do some testing and benchmarking in order to determine the ideal shard size. I would use the size range Leandro mentioned as a starting point.

It seems that you are using features and query types, e.g. wildcard, that are known to be heavy and slow. Combine this with large result sets and I doubt you will be able to get queries very fast. As you have large result sets, make sure you have as fast storage as possible, ideally fast local SSDs.

This is likely going to be your first/main bottleneck. HDDs does not sound at all suitable for your use case.

This means that if I have 1 terabyte of data in the index, I will need 21 shards.
And I want to have 1 replica for each shard.

Would this be too much, or is it reasonable?

What I’m thinking is: if I have 100 indices and apply the same shard count to each,
would that lead to oversharding, or is it considered normal?

Ouch :slight_smile:

What Kind of data is this? Is it a time series (does not necessarily mean logs or metrics, but it involves time as a primary component and/or a common query filter)? If so, I would break up the index into multiple indices. I would probably do that even if it is not time series for better manageability...

If not are they there other low cardinality dimension you could break up by that an use constant_keywords that could also greatly improve performance...

Approaches could greatly improve manageability and query performance if done / configured correctly

If this is a long term plan I would start from the ground up...

Understood, thank you.

I have another question:

What is the optimal size for a single node?

For example, I’m thinking: if I have five nodes, each with 15 TB of storage,
I’m considering limiting the data on each node to 10 TB or less.

Is it better to have large storage per node, or smaller storage per node?