ES cluster fails at random times

imewish · November 28, 2016, 5:03pm

Hi,

we have a elasticsearch cluster with, 2 client nodes(m4.large with 50gb storage/each node), 4 data node(m4.large with 1TB storage per each node). We have been sending our ROR app logs, ELB logs to this cluster, through logstash since last 1 year. We recently started logging our cloud front logs of our website to the same cluster. Since then the cluster has become unstable.,

The data nodes are showing full heap usage (default is 4GB).
in the client node logs i could see some logs like these,

logs-elk-es-client-2] Received response for a request that has timed out, sent [16454ms] ago, timed out [1454ms] ago, action [cluster:monitor/nodes/stats[n]], node [{logs-elk-es-data-4}{XNO89o2ySnuo30Zr0Fhs-w}{10.10.89.69}{10.10.89.69:9300}{max_local_storage_nodes=1, master=false}], id [104567]

i tried to restart the node which show's the problem(data 4).

after that the cluster starts re alocating shards and turns green, again it shows the same problem after few hours/ a day.

Im couldnt find out what makes the cluster fails yet,

helps would be appreciated.

ES version : 2.3.4

eperry · November 30, 2016, 5:54am

Sounds like you have finally overloaded your cluster. Need more hardware what do you mean full heap usage? If your near 100% then you need to add heap. and at a 4 TB of data you probably want something like 16GB or 30GB

Do you have Marvel/Kopf installed what is your load average during the time it failures. (And heap)

How many shards(and replications) do you have 1TB for each node is a lot (How many cpus does a m4 have) 4 right. Might consider upgrading to a bigger box or adding more servers. That is a lot of work for 4 nodes to process the data

you have
4 Data Nodes, 2 client nodes how many master nodes?

imewish · November 30, 2016, 6:44am

@eperry Hi Thanks for the reply.

I have solved the issue by adding two more data nodes of the same size. That reduced the heap usage of other nodes. Attaching the kopf screenshot of my current cluster and also the cluster status when the issue happened . Please have a look and suggest if we need to do any change on the cluster for better performance.

The old cluster has 4 data nodes and 2 master nodes,

Now there is 6 data nodes and 2 master nodes.

Christian_Dahlqvist · November 30, 2016, 7:38am

You should always aim to have (at least) 3 master eligible nodes in a cluster. With only 2 master eligible nodes, but need to be up in order to be able to elect a master if minimum_master_nodes correctly is set to 2. With 3 master eligible nodes one of the nodes can go down and it would be possible for the remaining two to reach consensus and elect a master, allowing the cluster to continue operating.

eperry · December 1, 2016, 12:41am

your cluster looks much like mine

the only thing i can say in general is ( just food for thought)

lets talk simple math

an index of 1tb and 1 shard means only one cpu can act on the data searching. so keeping shared to index to cpu ratio is important . the smaller the shars is the faster the cpu can process the search.

i am currently working with a 2.5xto data node rule of thumb . ie i have 13 data nodes so have about 30 shards plus replication. i dont have the luxary to play with many configs so your millage may varry. i also have huge amounts of disk and memory

keep an eye one your iowait

other then those words of wisdom cant think of much more

Topic		Replies	Views
Elasticsearch heap issues Elasticsearch	3	495	September 23, 2016
Newbie question, ES "sizing"? Elasticsearch	4	1345	July 27, 2015
Unstable cluster performance Elasticsearch	7	969	March 18, 2019
ES shards optimisation and improvement to cluster health Elasticsearch	9	920	January 28, 2020
Elasticsearch - Poor cluster performance and stability Elasticsearch	7	1431	June 20, 2019

ES cluster fails at random times

Related topics