Nodes are getting disconnecting and connecting again in few seconds

Hi There,

We could see few nodes are getting disconnected and connecting again, in few seconds time. We are seeing the below log.

Because of this, shards are becoming unassigned, and have to assign back, which is taking little longer time.

Could you please help us, on why are we seeing this behavior of disconnection? Is it something with elasticsearch or network. ??

Also, we are observing this with 2 nodes majorly, out of 4 data nodes.

[2017-02-01 05:58:20,196][INFO ][discovery.zen ] [ITTESPROD-DATA3] master_left [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-02-01 05:58:20,196][WARN ][discovery.zen ] [ITTESPROD-DATA3] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ITTESPROD-MSTR2}{WtSucYmHRUCUX_Ld7R7fBA}{10.158.36.202}{10.158.36.202:9300}{data=false, master=true},{ITTESPROD-CLIENT2}{0fDvB6DMSgCRDfCXxE0TBg}{10.158.36.209}{10.158.36.209:9300}{data=false, master=false},{ITTESPROD-DATA2}{vJzgp0a0Q-WXnAyFKdHcKw}{10.158.36.212}{10.158.36.212:9300}{master=false},{ITTESPROD-DATA1}{XNRK5gWBR2SnyIvD8Wnz6w}{10.158.36.211}{10.158.36.211:9300}{master=false},{ITTESPROD-DATA3}{7m7OdCyORaKSsGEXduh55g}{10.158.36.204}{10.158.36.204:9300}{master=false},{ITTESPROD-MSTR1}{fnPHHE1xREaNnz4kA6rrSA}{10.158.36.201}{10.158.36.201:9300}{data=false, master=true},{ITTESPROD-CLIENT1}{lQ5OG-qySBmnLpshqTxxfQ}{10.158.36.199}{10.158.36.199:9300}{data=false, master=false},{ITTESPROD-CLIENT0}{cGF5_yyrRZOh9D0IyqgD3Q}{10.158.36.220}{10.158.36.220:9300}{data=false, master=false},{ITTESPROD-DATA4}{3IaCVr0JROqQT40o81Oyjw}{10.158.36.208}{10.158.36.208:9300}{master=false},}
[2017-02-01 05:58:20,196][INFO ][cluster.service ] [ITTESPROD-DATA3] removed {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-master_failed ({ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true})
[2017-02-01 05:58:24,823][INFO ][cluster.service ] [ITTESPROD-DATA3] detected_master {ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}, added {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-02-01 05:58:25,793][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [4404297]
.....
[2017-02-01 06:30:01,062][INFO ][discovery.zen ] [ITTESPROD-DATA3] master_left [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-02-01 06:30:01,062][WARN ][discovery.zen ] [ITTESPROD-DATA3] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ITTESPROD-DATA1}{XNRK5gWBR2SnyIvD8Wnz6w}{10.158.36.211}{10.158.36.211:9300}{master=false},{ITTESPROD-CLIENT1}{lQ5OG-qySBmnLpshqTxxfQ}{10.158.36.199}{10.158.36.199:9300}{data=false, master=false},{ITTESPROD-CLIENT2}{0fDvB6DMSgCRDfCXxE0TBg}{10.158.36.209}{10.158.36.209:9300}{data=false, master=false},{ITTESPROD-DATA4}{3IaCVr0JROqQT40o81Oyjw}{10.158.36.208}{10.158.36.208:9300}{master=false},{ITTESPROD-MSTR2}{WtSucYmHRUCUX_Ld7R7fBA}{10.158.36.202}{10.158.36.202:9300}{data=false, master=true},{ITTESPROD-MSTR1}{fnPHHE1xREaNnz4kA6rrSA}{10.158.36.201}{10.158.36.201:9300}{data=false, master=true},{ITTESPROD-DATA3}{7m7OdCyORaKSsGEXduh55g}{10.158.36.204}{10.158.36.204:9300}{master=false},{ITTESPROD-DATA2}{vJzgp0a0Q-WXnAyFKdHcKw}{10.158.36.212}{10.158.36.212:9300}{master=false},{ITTESPROD-CLIENT0}{cGF5_yyrRZOh9D0IyqgD3Q}{10.158.36.220}{10.158.36.220:9300}{data=false, master=false},}
[2017-02-01 06:30:01,062][INFO ][cluster.service ] [ITTESPROD-DATA3] removed {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-master_failed ({ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true})
[2017-02-01 06:30:05,709][INFO ][cluster.service ] [ITTESPROD-DATA3] detected_master {ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}, added {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-02-01 06:30:06,824][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [4406570]
...
[2017-02-01 06:37:23,661][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][168550][2873] duration [1.4s], collections [1]/[2.3s], total [1.4s]/[28m], memory [25.6gb]->[18.9gb]/[27gb], all_pools {[young] [7.1gb]->[72.6mb]/[7.4gb]}{[survivor] [897.2mb]->[955.6mb]/[955.6mb]}{[old] [17.6gb]->[17.9gb]/[18.6gb]}
[2017-02-01 06:38:15,755][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][168578][2874] duration [6.6s], collections [1]/[24.6s], total [6.6s]/[28.1m], memory [26.1gb]->[3.1gb]/[27gb], all_pools {[young] [7.3gb]->[77.2mb]/[7.4gb]}{[survivor] [955.6mb]->[0b]/[955.6mb]}{[old] [17.9gb]->[3gb]/[18.6gb]}
[2017-02-01 06:38:15,755][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][old][168578][102] duration [17.2s], collections [1]/[24.6s], total [17.2s]/[2.1h], memory [26.1gb]->[3.1gb]/[27gb], all_pools {[young] [7.3gb]->[77.2mb]/[7.4gb]}{[survivor] [955.6mb]->[0b]/[955.6mb]}{[old] [17.9gb]->[3gb]/[18.6gb]}

Gc is taking a lot of time because your memory is under pressure.

It can explain that nodes don't answer to pings.

@dadoonet Could you please explain, what to check, why GC is taking more time. What are the common cause for this issue?

What to monitor, to avoid this?

Here is our data nodes config look like.

cluster.name: ittesprod
node.name: ITTESPROD-DATA1
node.master: false
node.data: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
network.host: _non_loopback_
http.enabled: false
discovery.zen.ping.unicast.hosts: ["10.158.36.220","10.158.36.200","10.158.36.201","10.158.36.202","10.158.36.210","10.158.36.211","10.158.36.212"]
threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 50%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

why GC is taking more time?

Not really more than usually. It's just taking time because you have apparently 30Gb of HEAP and Old GC on 30 Gb can take up to 2 minutes I believe.

What are the common cause for this issue?

You did not tell about your versions BTW (Java and ES).

Apparently you changed some default settings, so I guess you understood perfectly the consequence of it. Like threadpool.bulk.queue_size: 1000.
Not sure why you adjusted those knobs. But look at what monitoring is saying. Look at the bulk queue usage. Is it "full" or almost? Might use a lot of memory which has to be reclaimed.

May you are using fielddata for sorting, aggs?

Did you install Marvel/X-Pack to monitor your cluster? That might tell you something.

I had formatted the code now.

Not really more than usually. It's just taking time because you have apparently 30Gb of HEAP and Old GC on 30 Gb can take up to 2 minutes I believe

so, can I increase the ping timeout to 1 minute, to avoid the node disconnection? will that be a good option.

You did not tell about your versions BTW (Java and ES).

Java 1.8.0_111and ES 2.4.0

Apparently you changed some default settings, so I guess you understood perfectly the consequence of it. Like threadpool.bulk.queue_size: 1000.
Not sure why you adjusted those knobs. But look at what monitoring is saying. Look at the bulk queue usage. Is it "full" or almost? Might use a lot of memory which has to be reclaimed.

Agreed, but our bulk requests are getting rejected, because of this, we have increased that. Currently, I could see 200 queue for 1 node. all other nodes are in single number.

We are not using any field data, Field Size, Field Evictions and Cache is 0 continuously.

We installed HQ and Head for monitoring, as we cannot afford paid plugins as of now. Please suggest, if we can use any opensource free plugins.

I just realised that you opened also:

Let's keep the discussion there then. No need to open a new thread on the same problem IMO.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.