Cluster nodes get disconnected and out of sync due to ping timeouts caused by transport load

Hiya,

We have 3 elasticsearch clusters all of which are suffering a strange issue of the cluster failing after our application runs a query.

We have around 1.5 TB to 2.5TB of data (including replicas) across the 3 different clusters, the clusters are all the same in structure just on different hardware.

The data is stored in around 1500 indexes and we have been running without issues for over a year. Recently we have had a need to query more of the data in the cluster and are now running complex queries across more indexes than previously.

What is happening is when the query is launched the cluster distributes the search to all the indexes (well shards) but after a short time the cluster nodes themselves will start getting network timeouts and are removed from the cluster and puts the cluster into a red state. This is so bad that I have to fully restart the cluster to get back to being able to use it.

I believe the cause of this is that there is so much transport activity going on (seen in network graphs) that the network card becomes saturated to a point where the clusters state/alive pings actually time out. This is seen on both the masters and the nodes as errors showing the pings timed out followed not log after by the instance receiving the missing ping and logging that it arrived after it expired.

To prove this was the issue I added 2 network cards to all the servers in the cluster and used Linux bonding with roundrobin to give me a 2GB ethernet card rather than the 1GB previously used. Doing this provides enough bandwidth for the particular that test query to then succeed and return data as expected without any ping issues however using a slightly more complex query then causes the same problem.

Normally with clusters I have used in the past I would fix this by having the ping checks go over their own LAN and the transport data over a separate LAN, this allows the cluster to maintain its state checks even when the transport load is heavy.

Is there a way to do this with the current elasticsearch transport module, from what I have read so far apparently not but if anyone has suggestions on how it can be done I would love to know.

I already segregate http traffic for ES and the transport traffic to avoid timeout issues like this but I can't see how to split the pings from the standard transport data if this is even possible.

Forgot to mention the instances in the cluster are on version 2.3.4

The issue also occurs on version 2.3.1.

Regards

Lee

Hi @cardy,

Is there a way to do this [separating pings and search requests] with the current elasticsearch transport module

No, this is not possible at the moment.

I believe the cause of this is that there is so much transport activity
going on (seen in network graphs) that the network card becomes
saturated

Is there any chance that an index recovery or snapshot operation is running concurrently? Which type of query to you run?

Daniel

Hi Daniel,

To my knowledge there were no recovery operations in progress I waited for the cluster to recover and rebalance before starting each of my tests.

There are definitely no snapshots in use as we don't use them currently.

The only other activity could have come from our users but I have tried this both in and out of working hours and have the same outcome each time.

The only thing that seems to have affected it so far is running with the bonded network cards which stopped the ping time outs nothing else was changed for that test. Unfortunately adding another layer of aggregation once again broke it with ping timeouts.

The query is below, I have kept the structure but have changed some field names for commercial reasons:

POST large_dataset_alias/PROSET1/_search
{
   "size": 0,
   "query": {
      "filtered": {
         "filter": {
            "bool": {
               "must": [
                  {
                     "bool": {
                        "must": [
                           {
                              "range": {
                                 "Month": {
                                    "from": "2013-01-01",
                                    "to": "2015-12-01"
                                 }
                              }
                           },
                           {
                              "terms": {
                                 "searchcode": [
										"F2A",
										"LDR",
										"KJC",
										"N2B",
										"YP2"
                                 ]
                              }
                           }
                        ],
                        "must_not": [
                           {
                              "terms": {
                                 "sex": [
                                    "999999",
                                    "9",
                                    "0"
                                 ]
                              }
                           }
                        ]
                     }
                  }
               ],
               "must_not": [
                  {
                     "bool": {
                        "must": [
                           {
                              "term": {
                                 "FILTERFLAG": 1
                              }
                           }
                        ]
                     }
                  }
               ],
               "should": []
            }
         }
      }
   },
   "aggs": {
      "sex": {
         "terms": {
            "field": "sex",
            "size": 0
         },
         "aggs": {
            "agebands": {
               "range": {
                  "field": "age",
                  "ranges": [
                     {
                        "key": "0-19",
                        "from": 0,
                        "to": 20
                     },
                     {
                        "key": "20-39",
                        "from": 20,
                        "to": 40
                     },
                     {
                        "key": "40-59",
                        "from": 40,
                        "to": 60
                     },
                     {
                        "key": "60-79",
                        "from": 60,
                        "to": 80
                     },
                     {
                        "key": "80+",
                        "from": 80
                     }
                  ]
               },
               "aggs": {
                  "measurement2": {
                     "stats": {
                        "field": "measurement2"
                     }
                  }
               }
            }
         }
      }
   }
}

The terms list contains 209 values but I had to limit so i could post a reply

Kind Regards

Lee

Hi Lee,

I'll just summarize the detour to Github issue 19646:

There are two problems:

  • Long GC pauses (up to almost 2 minutes)
  • High traffic volume on transport layer saturates the network.

The GC issues should be addressed by tuning the garbage collector. This is not hugely Elasticsearch specific but we have a few tips on GC topics in the Definitive Guide.

As Boaz suggested on Github, you can try to reduce the number of shards to mitigate the second issue.

Daniel

1 Like