We have to index and search ~15-17B small documents a day. Cluster consist of 24 data nodes (12 servers with 2 ES instances each = 24 data nodes). Each shard is around 30GB on disk (48 shards) and around 55GB (24 shards). Smaller shard number worse in terms or rebalance time and awful for recovery time.
As we have a problem with bulk queue (link) there was assumption is that index writer is single threaded for a shard and shard number should be closer to number of concurent bulk indexing threads. Another assumption was what we hit into some lucene limit in document count per shard so indexing latency increases.
As for refresh interval and other index settings:
"settings": {
"index": {
"codec": "best_compression",
"refresh_interval": "30s",
"bloom": { "load": "false" },
"number_of_shards": "48",
"number_of_replicas": "1",
"store": {
"throttle": { "type": "none" }
},
"merge": {
"policy": {
"reclaim_deletes_weight": "0.0"
}
},
"mapper": {
"dynamic": "false"
},
"ttl": {
"disable_purge": "true"
}
}
}
We use 2.1.2 (this guide a little outdated) and try different approaches and settings but always open to all advices.
As for 20K we are still try to fugure out best setting, it was set at 1.7.x and worked well. Now we will try no change it and see if there be any change.