Hello,
I have an ELK stack that has been up and running for a few years now. Until recently the cluster was made up of all VMWare VMs, 3 master/data nodes and 2 data only nodes for older log data. We use the hot/cold architecture with the nodes labeled live/archive. Recently I added a physical host running 2 instances of ES, one master/live-data and one archive-data and removed one of the master VMs.
After adding the physical node I started having problems with one of the existing master-eligible VMs during the initial shard re-balance. In the logs I was seeing messages saying that the master node timed out when sending monitoring requests to the bad node and also messages saying that a response had been received after timing out. This was causing issues for logstash and I needed the cluster back so I shut down the VM that was reporting issues.
Everything worked again until I tried to move data from hot to cold nodes. During the process I started seeing the same errors in the master node logs about one of my data-only cold nodes. I tested ping between the master and cold node and noticed packet loss. When pinging out from the issue node I saw messages saying "ping: sendmsg: no buffer space available" and also had issues with the SSH session. I tried multiple settings to increase buffers but I never found a setting that resolved the issue. I tried to reboot the node and also increase the memory heap but I kept having issues until I shut the node down. Shortly after shutting down this node I began having the same problems with the other cold VM and I had to bring it down as well.
At this point I am running 2 hot masters (one vm and one physical) and 1 physical data-only cold node. I know I need to get a 3rd master into the mix but I would like to find out what is going on before I make any more changes to the cluster. Can anyone provide insight to why I am seeing connection issues and how I might go about resolving them? Is a mixed cluster (physical/virtual) supported and are there any special considerations that need to be taken to prevent these issues.
I understand this post may be confusing so please let me know if i can clarify anything or provide any more information. Thank you in advance!
Jonathan Spino
Current Cluster Info:
'
' [madmin@mbatlgl-es1 ~]$ curl -XGET 'http://localhost:9200/_nodes/process?pretty'
{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "mbatlgl",
"nodes" : {
"-KKM0U22RZeavf9U8CP4ug" : {
"name" : "mbatlgl-ark0",
"transport_address" : "192.168.9.29:9301",
"host" : "192.168.9.29",
"ip" : "192.168.9.29",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"data", "ingest"
],
"attributes" : {
"box_type" : "archive"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 30404,
"mlockall" : false
}
},
"VBqOu8fZQ72epEZzcn5QpQ" : {
"name" : "mbatlgl-es0",
"transport_address" : "192.168.9.29:9300",
"host" : "192.168.9.29",
"ip" : "192.168.9.29",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"master", "data", "ingest"
],
"attributes" : {
"box_type" : "live"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 32234,
"mlockall" : false
}
},
"JvKPMHt3R5a7j-TH0UMeTw" : {
"name" : "mbatlgl-es1",
"transport_address" : "192.168.9.27:9300",
"host" : "192.168.9.27",
"ip" : "192.168.9.27",
"version" : "5.6.8",
"build_hash" : "688ecce",
"roles" : [
"master", "data", "ingest"
],
"attributes" : {
"box_type" : "live"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 2854,
"mlockall" : false
}
}
}
}`