Difference between elastic node down and only elasticprocess down

Hello,

Short summary: I noticed this difference in elasticsearch behavior in 2 scenarios:
If elastic cluster is green and elasticproces goes down on one node , elastic cluster starts process of reallocting in after some hours it restores to green by itself.
If the cluster is green and then the whole node gets down, elastic cluster does nothing to recover.
Only after node is up again it gets better.

I am fine with behavior in scenario 1, but would like for elasticsearch to recover also in scenario 2.

I also read docs about Zen discovery and it all seems to be based on ping and IP/hostnames, so probably this is behaviour by design. https://www.elastic.co/guide/en/elasticsearch/reference/2.3/modules-discovery-zen.html

This question is partly for elasticsearch and also a bit for graylog. Maybe the community can help me about elastic part. I posted question in graylog community< but no answer there.

I am using elasticsearch version 6.8. It is used in combination with graylog 3.3. There are 3 graylog nodes and 7 elasticsearch nodes, all indices are using replicas.

Over weekend one of elastic nodes went down completely and we noticed elasticsearch turning to red, messages not being transfered from graylog to elasticsearch

I was tying to reproduce that scenario. I was doing some tests how graylog and elasticsearch behave if I stop the elasticsearch process on one of the elastic nodes
VS
what happens if I shut whole elastic node down. And the difference was really interesting. And I could reproduce this several times.

  1. Stopping elastic process on one node.
    The elasticsearch status may turn to yelow or even red for some time, but it notices the loss of one node, it starts recovery of missing data and in few hours it is up and runnig.
[INFO ][o.e.c.s.MasterService    ] [bjFS-gc] zen-disco-node-left({sb3T5Jf}{sb3T5JflRji8bu8p_pHKAA}{KN7gCyp1TviphiokUsBtpg}{<IP NODE>}{<IP NODE>:9300}{ml.machine_memory=3973660672, ml.max_open_jobs=20, xpack.installed=true, ml.e
nabled=true}), reason(left)
  1. shutting down the same elasticsearch node.
    Now the elasticsearch status remains the same - RED. and nothing happens.
    I found many elasticsearch erros in graylog server logs but no errors in elasticsearch logs

After half an hour, the elasticsearch node is powered on again. But the elasticsearch process is still down. No new logs in elastic logs. But soon after the elastic node is back again, elasticsearch (or graylog) notices that and starts processing messages again and sending them toward elasticsearch.
The elastic is still red most of the time, but if given enough time it will change to green.

Anybody seen similar behaviour ?

I am looking for some solution, that would in the case of elastic node going down prevent filling of the journal, and that graylog would continue to run.

Here is the elastic config:

cluster.name: bigger_graylog
path.data: /cached_d1,/cached_d2,/cached_d3,/cached_d4,/cached_d5,/cached_d6,/cached_d7,/cached_d8,/cached_d9,/cached_d10
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host:<IP1>
discovery.zen.ping.unicast.hosts: [":<IP1>:9300",":<IP2>:9300",":<IP3>:9300",":<IP4>:9300",":<IP5>:9300",":<IP6>:9300",":<IP7>:9300"]
discovery.zen.minimum_master_nodes: 4
path.repo: ["/bkp_mnt"]
xpack.monitoring.enabled: false
http.cors.enabled: true

Thanks in advance.

This sounds like the kind of problem that happens if you don't configure your TCP retransmission timeout as the docs recommend:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.