When I start to take care one of our cluster. I found this cluster is too hugh.
Shards: 5
Nodes: 5
Replica: 1
Each shard 400GB data.
Total documents: 5 billions.
Version: 1.4
I tried to do a simple search on the cluster but the query just run forever until timeout.
I tried elasticdump from github but also got stuck when running. The search latency in my monitoring is bumping high.
The data is time based, is there a way to archive the data? Any ideas?
Query and aggregation latency will depend on the shard size as each query/aggregation runs single threaded over each shard. Multiple shards and queries can however be processed in parallel. Having this large shards can therefore result in poor query performance.
If you have time-based data, we generally recommend that you use time-based indices.
As you are on a very old version, I would also recommend upgrading.
Having said that, I don't think there is any easy was out, and you will need to reindex your data into new indices. If even simple queries and scroll requests time out that may however be difficult.
I am using filter to extract the data monthly and seems its works for my case.
I am going to build new index based on month. In my calculation if I build index based on month.
1 index has 5 shards then each shard will have 60GB data about 5 millions documents.
Do you think this shard is too big ? What is the recommendation for a shards size?
The ideal shard size depends on the use case, but we generally recommend keeping it below 50GB. You do not have to go with 5 shards per index, and if they will get too large, maybe 8 or 10 shards per monthly index may be more suitable.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.