I have an index that contains 1 terabyte of events. The index currently has 1 shard and 1 replica.
I want to change the index settings and increase both the number of shards and replicas, but I don't know the optimal size for a single shard. I want the queries to be very fast.
The ideal shard size will depend on a number of factors, so there is not necessarily any single shard size that is optimal across the board. Some of these factors are:
Type and complexity of data and mappings
Type of queries used and size of result set returned
Number of concurrent queries
Level of concurrent indexing
Cluster size
Type of hardware used
It would help if you could shed some light on these. It would also be useful if you could tell us which version of Elasticsearch you are using.
The data and mappings can be considered complex—extremely complex, in fact—due to heavy use of nested structures and a very large number of fields (more than 1,000).
The data itself is a mixed type, including text, numbers, and a wide range of special characters.
Query patterns in use:
exists
wildcard — used heavily; almost every query includes at least one wildcard
term
match
match_all
must
The result set size for some queries (e.g., last 7 days of data) can reach up to 666 million documents.
Query concurrency:
The number of concurrent queries varies; currently up to 19, sometimes more, sometimes fewer.
That is clearly not a standard use case so I believe you will need to do some testing and benchmarking in order to determine the ideal shard size. I would use the size range Leandro mentioned as a starting point.
It seems that you are using features and query types, e.g. wildcard, that are known to be heavy and slow. Combine this with large result sets and I doubt you will be able to get queries very fast. As you have large result sets, make sure you have as fast storage as possible, ideally fast local SSDs.
This is likely going to be your first/main bottleneck. HDDs does not sound at all suitable for your use case.
What Kind of data is this? Is it a time series (does not necessarily mean logs or metrics, but it involves time as a primary component and/or a common query filter)? If so, I would break up the index into multiple indices. I would probably do that even if it is not time series for better manageability...
If not are they there other low cardinality dimension you could break up by that an use constant_keywords that could also greatly improve performance...
Approaches could greatly improve manageability and query performance if done / configured correctly
If this is a long term plan I would start from the ground up...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.