ES cluster fails at random times

Hi,

we have a elasticsearch cluster with, 2 client nodes(m4.large with 50gb storage/each node), 4 data node(m4.large with 1TB storage per each node). We have been sending our ROR app logs, ELB logs to this cluster, through logstash since last 1 year. We recently started logging our cloud front logs of our website to the same cluster. Since then the cluster has become unstable.,

  • The data nodes are showing full heap usage (default is 4GB).
    in the client node logs i could see some logs like these,

logs-elk-es-client-2] Received response for a request that has timed out, sent [16454ms] ago, timed out [1454ms] ago, action [cluster:monitor/nodes/stats[n]], node [{logs-elk-es-data-4}{XNO89o2ySnuo30Zr0Fhs-w}{10.10.89.69}{10.10.89.69:9300}{max_local_storage_nodes=1, master=false}], id [104567]

i tried to restart the node which show's the problem(data 4).

after that the cluster starts re alocating shards and turns green, again it shows the same problem after few hours/ a day.

Im couldnt find out what makes the cluster fails yet,

helps would be appreciated.

ES version : 2.3.4

Sounds like you have finally overloaded your cluster. Need more hardware :smiley: what do you mean full heap usage? If your near 100% then you need to add heap. and at a 4 TB of data you probably want something like 16GB or 30GB

Do you have Marvel/Kopf installed what is your load average during the time it failures. (And heap)

How many shards(and replications) do you have 1TB for each node is a lot (How many cpus does a m4 have) 4 right. Might consider upgrading to a bigger box or adding more servers. That is a lot of work for 4 nodes to process the data

you have
4 Data Nodes, 2 client nodes how many master nodes?

@eperry Hi Thanks for the reply.

I have solved the issue by adding two more data nodes of the same size. That reduced the heap usage of other nodes. Attaching the kopf screenshot of my current cluster and also the cluster status when the issue happened . Please have a look and suggest if we need to do any change on the cluster for better performance.

The old cluster has 4 data nodes and 2 master nodes,

Now there is 6 data nodes and 2 master nodes.

You should always aim to have (at least) 3 master eligible nodes in a cluster. With only 2 master eligible nodes, but need to be up in order to be able to elect a master if minimum_master_nodes correctly is set to 2. With 3 master eligible nodes one of the nodes can go down and it would be possible for the remaining two to reach consensus and elect a master, allowing the cluster to continue operating.

your cluster looks much like mine

the only thing i can say in general is ( just food for thought)

lets talk simple math

an index of 1tb and 1 shard means only one cpu can act on the data searching. so keeping shared to index to cpu ratio is important . the smaller the shars is the faster the cpu can process the search.

i am currently working with a 2.5xto data node rule of thumb . ie i have 13 data nodes so have about 30 shards plus replication. i dont have the luxary to play with many configs so your millage may varry. i also have huge amounts of disk and memory

keep an eye one your iowait

other then those words of wisdom cant think of much more

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.