Http/Transport module is inresponsive during high activity

Xorandor · February 17, 2016, 10:40am

The general symptom I'm experiencing is that seemingly random points in time, elasticsearch becomes inresponsive where neither the Http or Transport module will accept connections, giving a timeout when trying to connect to either of them.
I can reproduce it when logged onto the machine itself, making requests to the REST interface without going through network, i.e.
curl http://localhost:9200/

I started out running a one-machine cluster with ES 2.0, experiencing this issue during high load.
SInce then I've scaled out to a 3-machine cluster with 1 search node (node.master: false and node.data: false), upgraded to ES 2.1.1 and I'm still experiencing the exact same issues.
Except now that it's a cluster, during high load nodes constantly leaves and rejoins the cluster since it affects the Transport module as well, causing node pings to time out.

I'm running all nodes on Windows Server 2012R2.

I haven't found any way to investigate the problem. Whenever it happens, nothing is written to logs. When it happens, sometimes it will rejoin after 5-15 minutes, other times it never recovers and I have to kill the process manually - it won't even respond the stopping the service.

I've taken an excerpt from the log around the time when it occurred on one of the nodes, where the 15 minute 'silence' is the period where it became unresponsive.

[2016-02-17 11:11:56,968][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][young][35884][3275] duration [1.3m], collections [1]/[1.3m], total [1.3m]/[24.9m], memory [7gb]->[5.2gb]/[7.7gb], all_pools {[young] [2gb]->[73.7mb]/[2.1gb]}{[survivor] [253.2mb]->[273mb]/[273mb]}{[old] [4.7gb]->[4.8gb]/[5.3gb]}
[2016-02-17 11:13:26,811][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][young][35886][3276] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[26.3m], memory [7.1gb]->[5.4gb]/[7.7gb], all_pools {[young] [1.9gb]->[7.9mb]/[2.1gb]}{[survivor] [273mb]->[214.5mb]/[273mb]}{[old] [4.8gb]->[5.2gb]/[5.3gb]}
[2016-02-17 11:28:15,568][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][old][35899][239] duration [14.5m], collections [1]/[14.5m], total [14.5m]/[1.1h], memory [7.3gb]->[2.2gb]/[7.7gb], all_pools {[young] [1.9gb]->[10.8mb]/[2.1gb]}{[survivor] [214.5mb]->[0b]/[273mb]}{[old] [5.2gb]->[2.2gb]/[5.3gb]}
[2016-02-17 11:28:16,490][INFO ][discovery.zen ] [elasticsearch02.schultzprod.local] master_left [{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}], reason [failed to ping, tried [6] times, each with maximum [1m] timeout]
[2016-02-17 11:28:16,506][WARN ][discovery.zen ] [elasticsearch02.schultzprod.local] master left (reason = failed to ping, tried [6] times, each with maximum [1m] timeout), current nodes: {{elasticsearch04.schultzprod.local}{I63ADLtRRuqfSZvnujjydg}{10.76.173.53}{10.76.173.53:9300}{data=false, master=false},{elasticsearch02.schultzprod.local}{eBRzl0mKSyaigQHMplkKXw}{10.76.173.51}{10.76.173.51:9300},{elasticsearch03.schultzprod.local}{1jp-TRWuR8OfTLYqoSW-9Q}{10.76.173.52}{10.76.173.52:9300},}
[2016-02-17 11:28:16,662][INFO ][cluster.service ] [elasticsearch02.schultzprod.local] removed {{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300},}, reason: zen-disco-master_failed ({elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300})
[2016-02-17 11:28:28,626][INFO ][cluster.service ] [elasticsearch02.schultzprod.local] detected_master {elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}, added {{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300},}, reason: zen-disco-receive(from master [{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}])

Xorandor · February 21, 2016, 4:22pm

After I've gotten some more resources allocated to the machines (16GB ram upped to 32GB ram) these instability issues don't happen quite as often, although it can still happen.

Our cluster is used for storing logs from source systems and is very much index-heavy.

What I'm struggling most with is how to investigate the issues and finding the root cause of the problems I'm running in to, in cases like this where nothing obvious is being logged.

Topic		Replies	Views
Another odd ES freak out Elasticsearch	6	539	July 6, 2017
ElasticSearch, huge loadtest Elasticsearch	2	315	July 6, 2017
Cluster nodes get disconnected and out of sync due to ping timeouts caused by transport load Elasticsearch	4	3269	July 5, 2017
Strange errors under very little read/write load when running as cluster but stable on a single node Elasticsearch	1	1157	July 6, 2017
Loss of Connection between Nodes Elasticsearch	5	745	July 6, 2017

Http/Transport module is inresponsive during high activity

Related topics