I would like to ask how to organize data in cluster so that optimal performance and stability could be achieved?
My use case:
Every day cluster is ingesting around 10GB of documents. Now it contains around 3TB of data(measured disk usage). Later on metrics are calculated on those documents using scripted metric aggregations. Cluster is queried quite rarely lets just say that is has around 1000 requests per day. As time goes by stored data size increases and eventually I will have to delete some of that data.
Problem:
Stored data belongs to clients that have subclients. In cluster indices represent clients and document types represent subclients. All documents have the same template. So indices are not time based. If I would like to delete data from cluster this would be very inefficient so I considered continuing to index data to time base indices (maybe weekly or monthly). Here comes another problem. Currently cluster has around 700 indices so there are 3500(5 primary shards) primary shards and 3500(1 replica shard) replica shards. Problem is that I don't know how cluster will be impacted as shard count will increase. I have read that many shards is bad for cluster.
Cluster setup:
version used: 2.2
node count: 8
master/data nodes: 3
data nodes: 5
primary shards: 5
replica shards: 1
Any advice on this situation or general cluster setup is highly appreciated. Feel free to ask any additional information.
Switching to time-based indices will make it a lot more efficient to manage data retention. 3TB across 7000 shards means that your average shard size is very small. At that scale there is no need for each index to have 5 primary shards and if you were on version 5.x I would recommend using the _shrink API to reduce the number of primary shards from 5 to 1 for all indices, but as you are on Elasticsearch 2.2 you will probably need to reindex. If you have 700 clients and set up a monthly index with a single primary shard for each of them, you would get 16800 shards, which is a lot for a cluster with only 5 data node. You may therefore want to have each time based index cover an even longer time period, e.g. 3 or 6 months or maybe even 1 year if you have a very long retention period.
What is important is the average shard size, not necessarily the index size. How long time period do these indices cover? What is your retention period?
Average shard size (using _cat/shards) is 1GB. Also took a look at top 50 (by size) shards and they range from 50 to 15 GB. Indices are about a year old. It is kinda silly because company does not have established data retention policy yet so until then I just add add new data node if I see that cluster is running out of memory. And also correction about cluster setup there are 3 master/data nodes and 5 dedicated data nodes.
Without knowing much about your data and use case, I would recommend aiming for shard sizes of between a few GB and a few tens of GB. This keeps the number of shards on each node down and will allow you to host more data per node without incurring too much overhead.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.