Cluster health and the amount of indexes and data

Don_Pich · January 25, 2016, 2:52pm

So I have 4 nodes in my ELK stack. I have a logstash server that parses syslog messages into my cluster. I have 3 ES nodes. One is strictly a master node. The other two are master/data eligible.

So it's all working fine UNTIL I reach a 'magic point' in my data. At some point, the cluster goes red and I am not able to do anything to fix it. This seems to happen after the data nodes end up with a large amount of indexes. What I end up doing after snapshots is flushing all indexes that are 30 days and older, and the cluster goes green and the whole ELK stack starts functioning normally again.

I have the default shards (5) and 1 replica.

My question is this. I need to keep 90 days of active data plus 1 year of logs (Think PCI DSS). I have curator taking a snapshot of the data. My problem is that I can't get to 90 days with the raw amount of data.

Is this a sign of needing more shards? Do I need more data nodes?

warkolm · January 26, 2016, 1:50am

How much data per day? What are your node specs?

Christian_Dahlqvist · January 26, 2016, 6:32am

How many indices and shards do you have in the cluster? What is your total amount of data?

Don_Pich · January 26, 2016, 3:35pm

There is approximately 1 Gig of log data on a daily basis with that looking like it's going to expand.

All servers are setup as follows in VMWare
2 Sockets, 4 cores (8 CPUs) 16 Gigs of Ram/Heap 8 Gigs
500 Gig of disk space

Don_Pich · January 26, 2016, 3:36pm

At this time, there are 5 indexes with 5 shards per index. We are capturing roughly 1 gig to 1.5 gig of daily information. Disk space on the partition that stores the indexes was roughly 85% when I experienced the issue.

warkolm · January 27, 2016, 11:11am

You should reduce your shard count to 1 primary with 1 replica, otherwise it's a waste.

Don_Pich · February 1, 2016, 2:48pm

So forgive the question, but how would that help the issue I'm having?

Christian_Dahlqvist · February 1, 2016, 3:20pm

If I calculate correctly you have 5 indices * 5 shards * 2 copies * 90 days, which gives a total of 4500 shards. That is 2250 shards per node with only 8GB of heap. I would guess this is most likely the cause of your problems.

Each shard is a Lucene index and carries with it overhead in terms of memory usage and file handles. By following Marks advice and reducing the number of shards per index, and possibly also the number of indices, e.g. by using weekly or monthly indices, you can reduce the overhead and get your nodes to handle more data. Having a large number of very small shards wastes system resources, so I would recommend aiming to have shard sizes in the range of a few hundred MB to a few GB in size. Read this blog post for an example of how it is possible to crash an Elasticsearch cluster with too many shards even when no data has been indexed.

Topic		Replies	Views
Need advices for my cluster design Elasticsearch	4	1017	July 5, 2017
Elasticsearch heap issues Elasticsearch	4	438	July 5, 2017
No. of Shards Per index in ES Cluster Elasticsearch	4	1224	July 5, 2017
Weird index shard allocation behaviour Elasticsearch	7	413	August 12, 2021
Shard Recommendation for Elasticsearch Elasticsearch	4	320	July 6, 2017

Cluster health and the amount of indexes and data

Related topics