Problem of node detachment from the cluster

Hi Concern,
We have a 4 node Elasticsearch cluster which serves as the backend of Kibana for the log analysis.
The current roles of the nodes in the cluster are as follows-
1 dedicated master node
3 data node (with 2 masters eligible node role too)

We have a data size of about 7 TB spread across 3 data node in 3800 indices( 7820 shards).

Things are running pretty smooth but some times during peak hours one of node get detach from the cluster and re-assigning of indices take place.

The message which I receive in logs of that node is as follows-

Ip embedding of the node with issue:

2019-09-20T10:22:36,592][INFO ][o.e.c.c.JoinHelper ] [] failed to join {}{ml.machine_memory=48225767424, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [][1x.x1.4x.2x:9w0][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: failure when sending a validation request to node.

Caused by: org.elasticsearch.transport.RemoteTransportException: [][internal:cluster/coordination/join/validate]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [32041430762/29.8gb], which is larger than the limit of [31431121305/29.2gb], real usage: [32035789032/29.8gb], new bytes reserved: [5641730/5.3mb].

Please suggest what is it related to Heap UTlisation, please suggest some measure for heap optimisation?

This is far too many shards for 3 data nodes. Each shard consumes some heap, so I think the best way for you to reduce heap usage is to reduce the number of shards in your cluster. Here is an article with more information:

It looks like you have ~30GB of heap on your data nodes, so you should aim to limit the number of shards to less than 600 per node, or 1800 in total.

Hi David,
Thanks for the quick reply. Actually we make timestamp based indices on daily basis and then on weekly basis we merge the indices. We have about 130 different types of logs hence we are not able to decrease the shard counts after an extent.

One of the data node has about 2500 shards and another has 2300 shards. The node with 2500 shards is running fine from the very begiinning but the node with 2300 shards is causing problem. In terms of data size and doc counts things are almost similar on both the nodes....what is your suggestion over this?

Why can different log types not share indices?

I can only reiterate the guidance I shared above. The recommendation of ≤ 20 shards per 1GB of heap is not a hard-and-fast rule, but it is based on extensive experience. Given that you are running out of heap on one node you either need more nodes or fewer shards.

Also your 7820 shards only contain 7TB of data, which averages out as less than 1GB per shard. This is far smaller than the ~40GB per shard recommended in the article I shared above. You should be able to fit 7TB of data into ~200 shards by following that guideline. Have you considered using rollover instead of daily indices? This would let you target a much more efficient shard size straight away without needing to merge things on a weekly basis. You can automate rollover and other index optimisations using ILM too.

Hi Christian,
We tried to merge few properties together under one index. But then the size of indices go quite bulky i.e. the size of index ranges from 10-20 GB with about 30-40 millions doc count..
I was confuse whether that would impact our search speed??
Please suggest the best practice to deal with heavy index.

10-20GB is not very large. If you can not merge, would it make sense to instead switch to weekly or monthly indices?

Hi David,
I thought over your suggestion of keeping multiple properties in one indices to reduce the number of shards.
I was planning to group 8 properties previously kept in 8 indices(16 shards) into 1 index of 4-5 shards. In that scenario a lot of write/index operation will occur on a single index. I read it somewhere that keeping multiple shards for 1 index speed up the indexing speed.
Is it true?? if yes then wouldn't increasing indexing on 1 index impact the overall speed of indexing??

I would expect indexing into a smaller number of shards to be significantly more efficient than indexing into a larger number of shards, as there on average will be fewer fsyncs and write operations per document. Have you tested/measured/benchmarked this to find out for sure?

This was not my suggestion. I suggested using rollover (via ILM) to create new indices much less frequently so that that each shard is a more efficient size (i.e. much larger).

You can sometimes improve indexing performance with more primary shards in the active index, but not always. If you can get acceptable performance with 1 shard then use that. If you do need more shards then you can also use ILM to shrink the number of shards once indexing is complete to give you the best of both worlds.

But Christian is right, we have no reliable way to answer questions about the performance implications of any changes you make. You will need to run your own experiments.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.