VietDuc
(BuiVietDuc)
July 19, 2023, 3:29am
2
Today we know that it is actually a feature to make sure the performance better
opened 12:11PM - 31 May 22 UTC
closed 01:21PM - 08 Mar 23 UTC
>enhancement
:Data Management/ILM+SLM
Team:Data Management
### Description
Elasticsearch ships with some built-in ILM policies, e.g. the… re is one called `30-days-default` that looks like this:
```json
{
"phases": {
"hot": {
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "30d"
}
}
},
"warm": {
"min_age": "2d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"delete": {
"min_age": "30d",
"actions":{
"delete": {}
}
}
},
"_meta": {
"description": "built-in ILM policy using the hot and warm phases with a retention of 30 days",
"managed": true
}
}
```
I would suggest adding a `max_primary_shard_docs` next to the `max_primary_shard_size`. The actual value needs more discussions, but I'm thinking of something in the order of 200M in order to get a better search experience with space-efficient datasets. So datasets where documents take more than 50GB/200M = 268 bytes would rollover at 50GB while more space-efficient datasets would rollover at 200M documents. My motivation is the following:
- While shards could rollover before reaching 50GB, they would still not be small: 200M docs is not a small number of documents for a shard. As a data point, shards have a hard limit of 2B docs.
- Search performance depends more directly on the number of docs than on byte size, so having more control on the number of docs of a shard would give stronger guarantees on the sort of worst-case latency that shards can provide. This is especially relevant given recent/ongoing efforts to improve the space efficiency of Elasticsearch with runtime fields, doc-value-only fields or synthetic source.
- Aggregations have optimizations for the case when the range filter rewrites to a `match_all` query. Bounding the number of docs per primary shard makes it a bit more likely that some shards fully match the query, and the fact that shards that partially match the range filter have a bounded number of docs also helps bound tail latencies.
- Rollups can only operate or a rolled-over index, so users cannot enjoy the query speedup of rollups until the primary shard size is reached. Bounding the size of primary shards in a way that makes it easier to reason about query performance helps there too.
- Many use-cases for Elasticsearch involve `terms` or `composite` aggregations that need to build global ordinals under the hood. Global ordinals need to be built for the entire shard even though the query might only match a tiny part of it. Bounding the number of docs in a shard also helps bound the time it takes to build global ordinals.