Cluster health and the amount of indexes and data


(Don Pich) #1

So I have 4 nodes in my ELK stack. I have a logstash server that parses syslog messages into my cluster. I have 3 ES nodes. One is strictly a master node. The other two are master/data eligible.

So it's all working fine UNTIL I reach a 'magic point' in my data. At some point, the cluster goes red and I am not able to do anything to fix it. This seems to happen after the data nodes end up with a large amount of indexes. What I end up doing after snapshots is flushing all indexes that are 30 days and older, and the cluster goes green and the whole ELK stack starts functioning normally again.

I have the default shards (5) and 1 replica.

My question is this. I need to keep 90 days of active data plus 1 year of logs (Think PCI DSS). I have curator taking a snapshot of the data. My problem is that I can't get to 90 days with the raw amount of data.

Is this a sign of needing more shards? Do I need more data nodes?


(Mark Walkom) #2

How much data per day? What are your node specs?


(Christian Dahlqvist) #3

How many indices and shards do you have in the cluster? What is your total amount of data?


(Don Pich) #4

There is approximately 1 Gig of log data on a daily basis with that looking like it's going to expand.

All servers are setup as follows in VMWare
2 Sockets, 4 cores (8 CPUs) 16 Gigs of Ram/Heap 8 Gigs
500 Gig of disk space


(Don Pich) #5

At this time, there are 5 indexes with 5 shards per index. We are capturing roughly 1 gig to 1.5 gig of daily information. Disk space on the partition that stores the indexes was roughly 85% when I experienced the issue.


(Mark Walkom) #6

You should reduce your shard count to 1 primary with 1 replica, otherwise it's a waste.


(Don Pich) #7

So forgive the question, but how would that help the issue I'm having?


(Christian Dahlqvist) #8

If I calculate correctly you have 5 indices * 5 shards * 2 copies * 90 days, which gives a total of 4500 shards. That is 2250 shards per node with only 8GB of heap. I would guess this is most likely the cause of your problems.

Each shard is a Lucene index and carries with it overhead in terms of memory usage and file handles. By following Marks advice and reducing the number of shards per index, and possibly also the number of indices, e.g. by using weekly or monthly indices, you can reduce the overhead and get your nodes to handle more data. Having a large number of very small shards wastes system resources, so I would recommend aiming to have shard sizes in the range of a few hundred MB to a few GB in size. Read this blog post for an example of how it is possible to crash an Elasticsearch cluster with too many shards even when no data has been indexed.


(system) #9