Multi Node - Same Cluster Same hosts

Hello,

On our Production Platform we have Heap troubles which obliges us to restart our cluster every 10days~
(Heap of each nodes increase to 100%, so timeout on all cluster etc...)

data : 33TB, 700indx replicat 1

We don't know why we have this heap trouble (nothing very relevant on logs...)
We have 3 master nodes and 6 datanodes on elastic 5.5.0, configured with 31gb Heap.
As our servers have 256gb RAM, I think we can configure ES to have 2 datanodes on each datanodes servers, even if i see it's not recommended on PRODUCTION platform (with configuration node.max_local_storage_nodes: 2).

1st : is it a good thing to do ? What are the recommended configurations for that (i didn't see anything on elastic documentation on that point)
2: How work shards allocation on that case ? (Are we sure there was not primary shards and secondary on 2 nodes of the same host/server ?

FI: I test a configuration on vm, 2 systemd with different conf path /opt/application/elasticsearch/current/config/node01 and node02 with param max local storage but i'm not sure it will be the best things to do on production

Best regards.

  1. It does mean you can use existing resources more efficiently
  2. See Shard Allocation Awareness | Elasticsearch Reference [5.5] | Elastic

What version are you on? How many shards?
What monitoring do you have in place?

Hello and first of all thank you to answer on this topic :slight_smile:

Version of the cluster : 5.5.0
3 master nodes, 6 datanodes, Heap configured 31gb

Concerning data : 34.41TB, 746 indices, 8668 shards, 19 billions doc
When one node seems to have a 100% heap, logs only shows timeout and all our Kibana have timeout (as to test cluster healthy , I believe it's make a _nodes/status, so it's logical)

We have several projects which inject data through a nginx reverse proxy which make a roudrobin on all workernodes.
All workernodes also have nginx running on port 9200 to contact our authentication user + projects backend ( which rebalance on http port of worker, 9201 (and 9202 if we add new node)

Heap percent of each Worker reach fast enough, which force us to restart cluster, every 10 days approximatly...

If you have any comments on it, please don't hesitate

That is problematic given you only have 6 nodes. You should look to use _shrink to reduce that substantially, by half would be ideal.

1 Like

Humm, I can't do that, in fact we have several users with several projects and I can't shrink their data... :frowning:

Why not? It has no impact on the data.

1 Like

The data is not impacted at all, just how its distributed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.