Nodes are getting disconnecting and connecting again in few seconds

nethis · February 1, 2017, 6:46am

Hi There,

We could see few nodes are getting disconnected and connecting again, in few seconds time. We are seeing the below log.

Because of this, shards are becoming unassigned, and have to assign back, which is taking little longer time.

Could you please help us, on why are we seeing this behavior of disconnection? Is it something with elasticsearch or network. ??

Also, we are observing this with 2 nodes majorly, out of 4 data nodes.

[2017-02-01 05:58:20,196][INFO ][discovery.zen ] [ITTESPROD-DATA3] master_left [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-02-01 05:58:20,196][WARN ][discovery.zen ] [ITTESPROD-DATA3] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ITTESPROD-MSTR2}{WtSucYmHRUCUX_Ld7R7fBA}{10.158.36.202}{10.158.36.202:9300}{data=false, master=true},{ITTESPROD-CLIENT2}{0fDvB6DMSgCRDfCXxE0TBg}{10.158.36.209}{10.158.36.209:9300}{data=false, master=false},{ITTESPROD-DATA2}{vJzgp0a0Q-WXnAyFKdHcKw}{10.158.36.212}{10.158.36.212:9300}{master=false},{ITTESPROD-DATA1}{XNRK5gWBR2SnyIvD8Wnz6w}{10.158.36.211}{10.158.36.211:9300}{master=false},{ITTESPROD-DATA3}{7m7OdCyORaKSsGEXduh55g}{10.158.36.204}{10.158.36.204:9300}{master=false},{ITTESPROD-MSTR1}{fnPHHE1xREaNnz4kA6rrSA}{10.158.36.201}{10.158.36.201:9300}{data=false, master=true},{ITTESPROD-CLIENT1}{lQ5OG-qySBmnLpshqTxxfQ}{10.158.36.199}{10.158.36.199:9300}{data=false, master=false},{ITTESPROD-CLIENT0}{cGF5_yyrRZOh9D0IyqgD3Q}{10.158.36.220}{10.158.36.220:9300}{data=false, master=false},{ITTESPROD-DATA4}{3IaCVr0JROqQT40o81Oyjw}{10.158.36.208}{10.158.36.208:9300}{master=false},}
[2017-02-01 05:58:20,196][INFO ][cluster.service ] [ITTESPROD-DATA3] removed {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-master_failed ({ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true})
[2017-02-01 05:58:24,823][INFO ][cluster.service ] [ITTESPROD-DATA3] detected_master {ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}, added {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-02-01 05:58:25,793][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [4404297]
.....
[2017-02-01 06:30:01,062][INFO ][discovery.zen ] [ITTESPROD-DATA3] master_left [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-02-01 06:30:01,062][WARN ][discovery.zen ] [ITTESPROD-DATA3] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ITTESPROD-DATA1}{XNRK5gWBR2SnyIvD8Wnz6w}{10.158.36.211}{10.158.36.211:9300}{master=false},{ITTESPROD-CLIENT1}{lQ5OG-qySBmnLpshqTxxfQ}{10.158.36.199}{10.158.36.199:9300}{data=false, master=false},{ITTESPROD-CLIENT2}{0fDvB6DMSgCRDfCXxE0TBg}{10.158.36.209}{10.158.36.209:9300}{data=false, master=false},{ITTESPROD-DATA4}{3IaCVr0JROqQT40o81Oyjw}{10.158.36.208}{10.158.36.208:9300}{master=false},{ITTESPROD-MSTR2}{WtSucYmHRUCUX_Ld7R7fBA}{10.158.36.202}{10.158.36.202:9300}{data=false, master=true},{ITTESPROD-MSTR1}{fnPHHE1xREaNnz4kA6rrSA}{10.158.36.201}{10.158.36.201:9300}{data=false, master=true},{ITTESPROD-DATA3}{7m7OdCyORaKSsGEXduh55g}{10.158.36.204}{10.158.36.204:9300}{master=false},{ITTESPROD-DATA2}{vJzgp0a0Q-WXnAyFKdHcKw}{10.158.36.212}{10.158.36.212:9300}{master=false},{ITTESPROD-CLIENT0}{cGF5_yyrRZOh9D0IyqgD3Q}{10.158.36.220}{10.158.36.220:9300}{data=false, master=false},}
[2017-02-01 06:30:01,062][INFO ][cluster.service ] [ITTESPROD-DATA3] removed {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-master_failed ({ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true})
[2017-02-01 06:30:05,709][INFO ][cluster.service ] [ITTESPROD-DATA3] detected_master {ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}, added {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-02-01 06:30:06,824][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [4406570]
...
[2017-02-01 06:37:23,661][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][168550][2873] duration [1.4s], collections [1]/[2.3s], total [1.4s]/[28m], memory [25.6gb]->[18.9gb]/[27gb], all_pools {[young] [7.1gb]->[72.6mb]/[7.4gb]}{[survivor] [897.2mb]->[955.6mb]/[955.6mb]}{[old] [17.6gb]->[17.9gb]/[18.6gb]}
[2017-02-01 06:38:15,755][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][168578][2874] duration [6.6s], collections [1]/[24.6s], total [6.6s]/[28.1m], memory [26.1gb]->[3.1gb]/[27gb], all_pools {[young] [7.3gb]->[77.2mb]/[7.4gb]}{[survivor] [955.6mb]->[0b]/[955.6mb]}{[old] [17.9gb]->[3gb]/[18.6gb]}
[2017-02-01 06:38:15,755][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][old][168578][102] duration [17.2s], collections [1]/[24.6s], total [17.2s]/[2.1h], memory [26.1gb]->[3.1gb]/[27gb], all_pools {[young] [7.3gb]->[77.2mb]/[7.4gb]}{[survivor] [955.6mb]->[0b]/[955.6mb]}{[old] [17.9gb]->[3gb]/[18.6gb]}

dadoonet · February 1, 2017, 7:16am

Gc is taking a lot of time because your memory is under pressure.

It can explain that nodes don't answer to pings.

nethis · February 1, 2017, 7:38am

@dadoonet Could you please explain, what to check, why GC is taking more time. What are the common cause for this issue?

What to monitor, to avoid this?

Here is our data nodes config look like.

cluster.name: ittesprod
node.name: ITTESPROD-DATA1
node.master: false
node.data: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
network.host: _non_loopback_
http.enabled: false
discovery.zen.ping.unicast.hosts: ["10.158.36.220","10.158.36.200","10.158.36.201","10.158.36.202","10.158.36.210","10.158.36.211","10.158.36.212"]
threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 50%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

dadoonet · February 1, 2017, 8:21am

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

why GC is taking more time?

Not really more than usually. It's just taking time because you have apparently 30Gb of HEAP and Old GC on 30 Gb can take up to 2 minutes I believe.

What are the common cause for this issue?

You did not tell about your versions BTW (Java and ES).

Apparently you changed some default settings, so I guess you understood perfectly the consequence of it. Like threadpool.bulk.queue_size: 1000.
Not sure why you adjusted those knobs. But look at what monitoring is saying. Look at the bulk queue usage. Is it "full" or almost? Might use a lot of memory which has to be reclaimed.

May you are using fielddata for sorting, aggs?

Did you install Marvel/X-Pack to monitor your cluster? That might tell you something.

nethis · February 1, 2017, 9:56am

I had formatted the code now.

Not really more than usually. It's just taking time because you have apparently 30Gb of HEAP and Old GC on 30 Gb can take up to 2 minutes I believe

so, can I increase the ping timeout to 1 minute, to avoid the node disconnection? will that be a good option.

You did not tell about your versions BTW (Java and ES).

Java 1.8.0_111and ES 2.4.0

Apparently you changed some default settings, so I guess you understood perfectly the consequence of it. Like threadpool.bulk.queue_size: 1000.
Not sure why you adjusted those knobs. But look at what monitoring is saying. Look at the bulk queue usage. Is it "full" or almost? Might use a lot of memory which has to be reclaimed.

Agreed, but our bulk requests are getting rejected, because of this, we have increased that. Currently, I could see 200 queue for 1 node. all other nodes are in single number.

We are not using any field data, Field Size, Field Evictions and Cache is 0 continuously.

We installed HQ and Head for monitoring, as we cannot afford paid plugins as of now. Please suggest, if we can use any opensource free plugins.

dadoonet · February 1, 2017, 10:34am

I just realised that you opened also:

Let's keep the discussion there then. No need to open a new thread on the same problem IMO.

system · March 1, 2017, 10:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why does node disconnect after three time big gc? Elasticsearch	20	1834	February 19, 2019
Frequent disconnects between nodes Elasticsearch	13	2337	July 6, 2017
Inconsistent search cluster status and search results after long GC run Elasticsearch	5	789	July 6, 2017
Unable to join the cluster Elasticsearch	5	546	July 6, 2017
ES 1.4.2 random node disconnect Elasticsearch	4	429	July 6, 2017

Nodes are getting disconnecting and connecting again in few seconds

Related topics