I made a thread on reddit and was recommended to make one here: https://www.reddit.com/r/elasticsearch/comments/8dvkx0/elasticsearch_issue_i_cant_figure_out_cant_curl/
We have a really annoying issue that is effecting our production cluster so any guidance would really be appreciated.
Problem: I have a 3 node ES clustesr. Randomly a elasticsearch non-master node suddenly stops working. The service itself is running fine however curl (only on port 9200) simply times out:
curl localhost:9200 --verbose
* About to connect() to localhost port 9200 (#0)
* Trying ::1...
* Connected to localhost (::1) port 9200 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
>
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer```
This happens intermittently at first for a duration of roughly around 10 minutes before curl times out all the time on port 9200. This can happen to one non-master node, or to both non-master nodes (at different times), and what is more alarming is that the search stops reponding however the cluster state remains green.
Only of the elasticsearch non-master nodes are unable to suddenly or intermittently ping the master. Which is confusing, because if it is network, then why can one node ping the master, but not the other? There's nothing significant in the master node logs.
When I check netstat, the Recv-Q number, within that 10 minute period that curl is interrmitently timing out, the Recv-Q is slowly building up until it remains at 129 permanantly:
netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 129 0 :::9200 :::* LISTEN 22554/java
tcp6 0 0 :::9300 :::* LISTEN 22554/java```
I'm not sure why the above happens. I have the following settings configured:
net.ipv4.tcp_max_syn_backlog = 512
net.ipv4.tcp_syncookies = 1
But there's nothing in the messages logs about a SYN flood attack.
To temporarily get the node back working again I have to kill the elasticsearch process (doing a systemctl restart just causes it to hang), and then I can start it up again. The curl issue never effects the Master node, and I can curl other processes on the problem node just fine (ie. curl localhost:80 works, but localhost:9200 does not).
The reverese http proxy nginx simply returns a 504 or 502. All system metrics themselves look fine in terms of usage, same with ES, heap memory size, networking all looks normal before the drop off.
Our cluster is currently running on Virtual Machines in Azure.
Has anyone ever come across this problem before, or has any guidance on things I can check, or even double check in case I've missed something?