Issues with stability in mixed (Physical/Virtual) Environment

Hello,

I have an ELK stack that has been up and running for a few years now. Until recently the cluster was made up of all VMWare VMs, 3 master/data nodes and 2 data only nodes for older log data. We use the hot/cold architecture with the nodes labeled live/archive. Recently I added a physical host running 2 instances of ES, one master/live-data and one archive-data and removed one of the master VMs.

After adding the physical node I started having problems with one of the existing master-eligible VMs during the initial shard re-balance. In the logs I was seeing messages saying that the master node timed out when sending monitoring requests to the bad node and also messages saying that a response had been received after timing out. This was causing issues for logstash and I needed the cluster back so I shut down the VM that was reporting issues.

Everything worked again until I tried to move data from hot to cold nodes. During the process I started seeing the same errors in the master node logs about one of my data-only cold nodes. I tested ping between the master and cold node and noticed packet loss. When pinging out from the issue node I saw messages saying "ping: sendmsg: no buffer space available" and also had issues with the SSH session. I tried multiple settings to increase buffers but I never found a setting that resolved the issue. I tried to reboot the node and also increase the memory heap but I kept having issues until I shut the node down. Shortly after shutting down this node I began having the same problems with the other cold VM and I had to bring it down as well.

At this point I am running 2 hot masters (one vm and one physical) and 1 physical data-only cold node. I know I need to get a 3rd master into the mix but I would like to find out what is going on before I make any more changes to the cluster. Can anyone provide insight to why I am seeing connection issues and how I might go about resolving them? Is a mixed cluster (physical/virtual) supported and are there any special considerations that need to be taken to prevent these issues.

I understand this post may be confusing so please let me know if i can clarify anything or provide any more information. Thank you in advance!

Jonathan Spino

Current Cluster Info:
'
' [madmin@mbatlgl-es1 ~]$ curl -XGET 'http://localhost:9200/_nodes/process?pretty'
{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "mbatlgl",
"nodes" : {
"-KKM0U22RZeavf9U8CP4ug" : {
"name" : "mbatlgl-ark0",
"transport_address" : "192.168.9.29:9301",
"host" : "192.168.9.29",
"ip" : "192.168.9.29",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"data", "ingest"
],
"attributes" : {
"box_type" : "archive"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 30404,
"mlockall" : false
}
},
"VBqOu8fZQ72epEZzcn5QpQ" : {
"name" : "mbatlgl-es0",
"transport_address" : "192.168.9.29:9300",
"host" : "192.168.9.29",
"ip" : "192.168.9.29",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"master", "data", "ingest"
],
"attributes" : {
"box_type" : "live"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 32234,
"mlockall" : false
}
},
"JvKPMHt3R5a7j-TH0UMeTw" : {
"name" : "mbatlgl-es1",
"transport_address" : "192.168.9.27:9300",
"host" : "192.168.9.27",
"ip" : "192.168.9.27",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"master", "data", "ingest"
],
"attributes" : {
"box_type" : "live"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 2854,
"mlockall" : false
}
}
}
}`

Below are the relevant logs from the master node:

[2018-06-08T00:04:38,453][INFO ][o.e.c.m.MetaDataMappingService] [mbatlgl-es1] [mbatlgl_800/Y1vHNKo-R4m_b6hNurvHHg] update_mapping [message]
[2018-06-08T00:06:10,004][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:06:12,105][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mbatlgl-es1] failed to execute on node [HXRQVEkTRreOk1Wrq_QNXg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [mbatlgl-ark1][192.168.9.31:9300][cluster:monitor/nodes/stats[n]] request_id [63828312] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
[2018-06-08T00:06:20,170][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:06:30,170][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [mbatlgl-es1] collector [cluster_stats] timed out when collecting data
[2018-06-08T00:06:50,006][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:07:00,170][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:07:08,260][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [71156ms] ago, timed out [56156ms] ago, actio
n [cluster:monitor/nodes/stats[n]], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63828312]
[2018-06-08T00:07:08,262][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [73590ms] ago, timed out [43589ms] ago, actio
n [internal:discovery/zen/fd/ping], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63828293]
[2018-06-08T00:07:08,262][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [43589ms] ago, timed out [13589ms] ago, actio
n [internal:discovery/zen/fd/ping], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63828426]
[2018-06-08T00:17:34,411][INFO ][o.e.c.m.MetaDataMappingService] [mbatlgl-es1] [mbatlgl_800/Y1vHNKo-R4m_b6hNurvHHg] update_mapping [message]
[2018-06-08T00:19:51,478][INFO ][o.e.c.m.MetaDataMappingService] [mbatlgl-es1] [mbatlgl_800/Y1vHNKo-R4m_b6hNurvHHg] update_mapping [message]
[2018-06-08T00:23:00,015][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:23:10,208][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:23:20,208][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [mbatlgl-es1] collector [cluster_stats] timed out when collecting data
[2018-06-08T00:23:35,149][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [51607ms] ago, timed out [21607ms] ago, action [internal:discovery/zen/fd/ping], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63845141]
[2018-06-08T00:30:40,020][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:30:50,052][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:30:53,976][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mbatlgl-es1] failed to execute on node [HXRQVEkTRreOk1Wrq_QNXg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [mbatlgl-ark1][192.168.9.31:9300][cluster:monitor/nodes/stats[n]] request_id [63852703] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
[2018-06-08T00:30:55,473][WARN ][o.e.t.TransportService ] [mbatlgl-es1] Received response for a request that has timed out, sent [16497ms] ago, timed out [1497ms] ago, action [cluster:monitor/nodes/stats[n]], node [{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}], id [63852703]
[2018-06-08T00:35:40,025][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [mbatlgl-es1] collector [index-recovery] timed out when collecting data
[2018-06-08T00:35:48,724][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [mbatlgl-es1] failed to execute on node [HXRQVEkTRreOk1Wrq_QNXg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [mbatlgl-ark1][192.168.9.31:9300][cluster:monitor/nodes/stats[n]] request_id [63857637] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
[2018-06-08T00:35:50,057][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mbatlgl-es1] collector [index-stats] timed out when collecting data
[2018-06-08T00:36:00,057][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [mbatlgl-es1] collector [cluster_stats] timed out when collecting data

These are the logs on the problem node, at the same time I recieve "ping: sendmsg: no buffer space available" when trying to ping out and I have issues with SSH sessions on the host:

java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,740][INFO ][o.e.d.z.ZenDiscovery ] [mbatlgl-ark1] master_left [{mbatlgl-es1}{JvKPMHt3R5a7j-TH0UMeTw}{VNJvjUDgQw-AsAcuP5Llew}{192.168.9.27}{192.168.9.27:9300}{box_type=live}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-06-08T00:37:20,741][WARN ][o.e.d.z.ZenDiscovery ] [mbatlgl-ark1] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{mbatlgl-es0}{VBqOu8fZQ72epEZzcn5QpQ}{jAjFvqcsTLW8mblOY0CYUw}{192.168.9.29}{192.168.9.29:9300}{box_type=live}
{mbatlgl-ark1}{HXRQVEkTRreOk1Wrq_QNXg}{R2vu4ckiSlWRxE24crdhgA}{192.168.9.31}{192.168.9.31:9300}{box_type=archive}, local
{mbatlgl-es1}{JvKPMHt3R5a7j-TH0UMeTw}{VNJvjUDgQw-AsAcuP5Llew}{192.168.9.27}{192.168.9.27:9300}{box_type=live}, master
{mbatlgl-ark0}{-KKM0U22RZeavf9U8CP4ug}{YOjhVcQ4STizkuIL4u45ZA}{192.168.9.29}{192.168.9.29:9301}{box_type=archive}
[2018-06-08T00:37:20,740][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xb5b46d9b, L:/192.168.9.31:9300 ! R:/192.168.9.29:55508])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,740][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xfd58442b, L:/192.168.9.31:9300 ! R:/192.168.9.29:55498])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,764][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x6374fb2a, L:0.0.0.0/0.0.0.0:9300 ! R:/192.168.9.27:50496])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,766][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x6374fb2a, L:0.0.0.0/0.0.0.0:9300 ! R:/192.168.9.27:50496])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,766][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x6374fb2a, L:0.0.0.0/0.0.0.0:9300 ! R:/192.168.9.27:50496])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,781][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xc2ffdce6, L:/192.168.9.31:9300 ! R:/192.168.9.27:50488])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,787][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xc2ffdce6, L:/192.168.9.31:9300 ! R:/192.168.9.27:50488])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,862][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0x2d5de1d5, L:/192.168.9.31:9300 ! R:/192.168.9.27:50494])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-06-08T00:37:20,863][WARN ][o.e.t.n.Netty4Transport ] [mbatlgl-ark1] write and flush on the network layer failed (channel: [id: 0xb15297b5, L:/192.168.9.31:9300 ! R:/192.168.9.27:50498])
java.nio.channels.ClosedChannelException: null

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.