Elasticsearch node unresponsive, high active_opens, CLOSE_WAIT


#1

Received an alert saying that one of my nodes was down, when I tried to curl / it just hung.

Checked on the health of the cluster and the node and noted something very strange:

        "network": {
            "tcp": {
                "active_opens": 102188,
                "passive_opens": 7133683,
                "curr_estab": 205,
                "in_segs": 1483621255,
                "out_segs": 2405602124,
                "retrans_segs": 569006,
                "estab_resets": 9251,
                "attempt_fails": 3252,
                "in_errs": 11,
                "out_rsts": 23640
            }
        },
# sudo netstat -tupn |grep CLOSE_WAIT |  wc -l
11711
# sudo netstat -tupn  |  wc -l
11940

shows a ton of CLOSE_WAIT

The active opens were about 10x more on this node than any other one. What are active opens vs passive opens, and what is the expected number of active/passive opens, and how can I make elasticsearch close these connections more aggressively?

I'm running 1.7.1 on Java 1.8

{
  "status" : 200,
  "name" : <redacted>,
  "cluster_name" : <redacted>,
  "version" : {
    "number" : "1.7.1",
    "build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
    "build_timestamp" : "2015-07-29T09:54:16Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

http://elasticsearch-users.115913.n3.nabble.com/Increasing-CLOSE-WAIT-connections-and-HTTP-current-open-metric-td4019752.html

seems to be related, but it doesn't to have a conclusive solution or understanding of what is happening


(Jason Wee) #2

i encounter the same situation as similar as your yesterday. today when i check again on the system monitoring history, have high number of passive_opens just before this node become unresponsive. usual passive open hang around 2,500,000 and because i written a multithreading script with 2 threads, the passive open goes to approximately 3,500,000 just before this node become unresponsive and detach from the cluster.

other metrics were also check like higher than usage for cpu usage on %user , index rate , translog operations , merge requests, cms gc activities and jvm direct pool mem usage

from these empirically, i guess the node is just too busy due to gc , merge and index and it timed out.. i can see several timeout of ping request in the log too.

hth


(system) #3