Hi
I have 3 nodes and 55TB data to index, I can split it to 10 index, each one contain 155 shard or 110 index each contain 10 shards.
I don't know which one is the best?
Can any one help?
What is the use case? What kind of data do you have? How are you going to query/use it?
The data is log, most of the fields are structured, just need to search on one field, but I want to use kibana to draw different visualizations.
Each record is max 500 bytes.
In that case the recommended best practice is to use time-based indices. Make sure you follow these guidelines on shard sizes and sharding practices. The following resources may also be useful:
https://www.elastic.co/webinars/optimizing-storage-efficiency-in-elasticsearch
https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right
https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing
Having said that, having only 3 nodes for 55TB of raw data sounds a bit small, especially if you intend to have replicas in order to get high availability. I would however recommend running some tests to see how much data you can hold on your particular hardware.
Each node has 40 cores, 128GB ram and 12 hdd (6TB) which are raid 10 in three arrays.
I read the documents, but still can't decide which one is better.
What is not clear? Which options are you considering? Unsure how to apply time-based indices?
Sorry to ask again, I am not professional in that (although I read all elastic docs).
I don't know which way to go:
- Many indexes, each index few shards.
- Few indexes, each index many shards.
In both architectures, each shard size is at max 30GB.
Thanks
It depends on your data. How many different types of data? For each type, how much data do you have per day? What is the total time period covered by this data set?
All data are same type, belongs to 24hours, after indexing, new data will not append.
Then time-based indices may not be applicable. Try to align the indices with how you query the data. The feeer shards you need to query the better performance I would expect. If you are always going to query the full data set the total shard count may be more important that exactly how these divide into indices.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.