Issues with stability in mixed (Physical/Virtual) Environment

jcspino · June 14, 2018, 8:27pm

Hello,

I have an ELK stack that has been up and running for a few years now. Until recently the cluster was made up of all VMWare VMs, 3 master/data nodes and 2 data only nodes for older log data. We use the hot/cold architecture with the nodes labeled live/archive. Recently I added a physical host running 2 instances of ES, one master/live-data and one archive-data and removed one of the master VMs.

After adding the physical node I started having problems with one of the existing master-eligible VMs during the initial shard re-balance. In the logs I was seeing messages saying that the master node timed out when sending monitoring requests to the bad node and also messages saying that a response had been received after timing out. This was causing issues for logstash and I needed the cluster back so I shut down the VM that was reporting issues.

Everything worked again until I tried to move data from hot to cold nodes. During the process I started seeing the same errors in the master node logs about one of my data-only cold nodes. I tested ping between the master and cold node and noticed packet loss. When pinging out from the issue node I saw messages saying "ping: sendmsg: no buffer space available" and also had issues with the SSH session. I tried multiple settings to increase buffers but I never found a setting that resolved the issue. I tried to reboot the node and also increase the memory heap but I kept having issues until I shut the node down. Shortly after shutting down this node I began having the same problems with the other cold VM and I had to bring it down as well.

At this point I am running 2 hot masters (one vm and one physical) and 1 physical data-only cold node. I know I need to get a 3rd master into the mix but I would like to find out what is going on before I make any more changes to the cluster. Can anyone provide insight to why I am seeing connection issues and how I might go about resolving them? Is a mixed cluster (physical/virtual) supported and are there any special considerations that need to be taken to prevent these issues.

I understand this post may be confusing so please let me know if i can clarify anything or provide any more information. Thank you in advance!

Jonathan Spino

Current Cluster Info:
'
' [madmin@mbatlgl-es1 ~]$ curl -XGET 'http://localhost:9200/_nodes/process?pretty'
{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "mbatlgl",
"nodes" : {
"-KKM0U22RZeavf9U8CP4ug" : {
"name" : "mbatlgl-ark0",
"transport_address" : "192.168.9.29:9301",
"host" : "192.168.9.29",
"ip" : "192.168.9.29",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"data", "ingest"
],
"attributes" : {
"box_type" : "archive"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 30404,
"mlockall" : false
}
},
"VBqOu8fZQ72epEZzcn5QpQ" : {
"name" : "mbatlgl-es0",
"transport_address" : "192.168.9.29:9300",
"host" : "192.168.9.29",
"ip" : "192.168.9.29",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"master", "data", "ingest"
],
"attributes" : {
"box_type" : "live"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 32234,
"mlockall" : false
}
},
"JvKPMHt3R5a7j-TH0UMeTw" : {
"name" : "mbatlgl-es1",
"transport_address" : "192.168.9.27:9300",
"host" : "192.168.9.27",
"ip" : "192.168.9.27",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"master", "data", "ingest"
],
"attributes" : {
"box_type" : "live"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 2854,
"mlockall" : false
}
}
}
}`

jcspino · June 14, 2018, 8:28pm

Below are the relevant logs from the master node:

[2018-06-08T00:04:38,453][INFO ][o.e.c.m.MetaDataMappingService] [mbatlgl-es1] [mbatlgl_800/Y1vHNKo-R4m_b6hNurvHHg] update_mapping [message]
[2018-06-08T00:06:10,004][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:06:12,105][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mbatlgl-es1] failed to execute on node [HXRQVEkTRreOk1Wrq_QNXg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [mbatlgl-ark1][192.168.9.31:9300][cluster:monitor/nodes/stats[n]] request_id [63828312] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
[2018-06-08T00:06:20,170][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:06:30,170][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [mbatlgl-es1] collector [cluster_stats] timed out when collecting data
[2018-06-08T00:06:50,006][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:07:00,170][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:07:08,260][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [71156ms] ago, timed out [56156ms] ago, actio
n [cluster:monitor/nodes/stats[n]], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63828312]
[2018-06-08T00:07:08,262][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [73590ms] ago, timed out [43589ms] ago, actio
n [internal:discovery/zen/fd/ping], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63828293]
[2018-06-08T00:07:08,262][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [43589ms] ago, timed out [13589ms] ago, actio
n [internal:discovery/zen/fd/ping], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63828426]
[2018-06-08T00:17:34,411][INFO ][o.e.c.m.MetaDataMappingService] [mbatlgl-es1] [mbatlgl_800/Y1vHNKo-R4m_b6hNurvHHg] update_mapping [message]
[2018-06-08T00:19:51,478][INFO ][o.e.c.m.MetaDataMappingService] [mbatlgl-es1] [mbatlgl_800/Y1vHNKo-R4m_b6hNurvHHg] update_mapping [message]
[2018-06-08T00:23:00,015][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:23:10,208][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:23:20,208][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [mbatlgl-es1] collector [cluster_stats] timed out when collecting data
[2018-06-08T00:23:35,149][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [51607ms] ago, timed out [21607ms] ago, action [internal:discovery/zen/fd/ping], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63845141]
[2018-06-08T00:30:40,020][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:30:50,052][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:30:53,976][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mbatlgl-es1] failed to execute on node [HXRQVEkTRreOk1Wrq_QNXg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [mbatlgl-ark1][192.168.9.31:9300][cluster:monitor/nodes/stats[n]] request_id [63852703] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
[2018-06-08T00:30:55,473][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [16497ms] ago, timed out [1497ms] ago, action [cluster:monitor/nodes/stats[n]], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63852703]
[2018-06-08T00:35:40,025][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:35:48,724][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mbatlgl-es1] failed to execute on node [HXRQVEkTRreOk1Wrq_QNXg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [mbatlgl-ark1][192.168.9.31:9300][cluster:monitor/nodes/stats[n]] request_id [63857637] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
[2018-06-08T00:35:50,057][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:36:00,057][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [mbatlgl-es1] collector [cluster_stats] timed out when collecting data

jcspino · June 14, 2018, 8:29pm

These are the logs on the problem node, at the same time I recieve "ping: sendmsg: no buffer space available" when trying to ping out and I have issues with SSH sessions on the host:

java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,740][INFO ][o.e.d.z.ZenDiscovery ] [mbatlgl-ark1] master_left [{mbatlgl-es1}{JvKPMHt3R5a7j-TH0UMeTw}{VNJvjUDgQw-AsAcuP5Llew}{192.168.9.27}{192.168.9.27:9300}{box_type=live}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-06-08T00:37:20,741][WARN ][o.e.d.z.ZenDiscovery ] [mbatlgl-ark1] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{mbatlgl-es0}{VBqOu8fZQ72epEZzcn5QpQ}{jAjFvqcsTLW8mblOY0CYUw}{192.168.9.29}{192.168.9.29:9300}{box_type=live}
{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}, local
{mbatlgl-es1}{JvKPMHt3R5a7j-TH0UMeTw}{VNJvjUDgQw-AsAcuP5Llew}{192.168.9.27}{192.168.9.27:9300}{box_type=live}, master
{mbatlgl-ark0}{-KKM0U22RZeavf9U8CP4ug}{YOjhVcQ4STizkuIL4u45ZA}{192.168.9.29}{192.168.9.29:9301}{box_type=archive}
[2018-06-08T00:37:20,740][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xb5b46d9b, L:/192.168.9.31:9300 ! R:/192.168.9.29:55508])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,740][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xfd58442b, L:/192.168.9.31:9300 ! R:/192.168.9.29:55498])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,764][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x6374fb2a, L:0.0.0.0/0.0.0.0:9300 ! R:/192.168.9.27:50496])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,766][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x6374fb2a, L:0.0.0.0/0.0.0.0:9300 ! R:/192.168.9.27:50496])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,766][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x6374fb2a, L:0.0.0.0/0.0.0.0:9300 ! R:/192.168.9.27:50496])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,781][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xc2ffdce6, L:/192.168.9.31:9300 ! R:/192.168.9.27:50488])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,787][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xc2ffdce6, L:/192.168.9.31:9300 ! R:/192.168.9.27:50488])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,862][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x2d5de1d5, L:/192.168.9.31:9300 ! R:/192.168.9.27:50494])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,863][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xb15297b5, L:/192.168.9.31:9300 ! R:/192.168.9.27:50498])
java.nio.channels.ClosedChannelException: null

system · July 12, 2018, 8:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Cluster issues Elasticsearch	17	4080	May 23, 2019
ElasticSearch Unstable Elasticsearch	13	7134	July 4, 2018
Cluster connection issues when the machines hosting the nodes are restarted for service maintanance Elasticsearch	7	1067	July 6, 2017
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	18	3119	September 3, 2020
Another odd ES freak out Elasticsearch	6	552	July 6, 2017

Issues with stability in mixed (Physical/Virtual) Environment

Related topics