Http/Transport module is inresponsive during high activity


(Allan Hansen) #1

The general symptom I'm experiencing is that seemingly random points in time, elasticsearch becomes inresponsive where neither the Http or Transport module will accept connections, giving a timeout when trying to connect to either of them.
I can reproduce it when logged onto the machine itself, making requests to the REST interface without going through network, i.e.
curl http://localhost:9200/

I started out running a one-machine cluster with ES 2.0, experiencing this issue during high load.
SInce then I've scaled out to a 3-machine cluster with 1 search node (node.master: false and node.data: false), upgraded to ES 2.1.1 and I'm still experiencing the exact same issues.
Except now that it's a cluster, during high load nodes constantly leaves and rejoins the cluster since it affects the Transport module as well, causing node pings to time out.

I'm running all nodes on Windows Server 2012R2.

I haven't found any way to investigate the problem. Whenever it happens, nothing is written to logs. When it happens, sometimes it will rejoin after 5-15 minutes, other times it never recovers and I have to kill the process manually - it won't even respond the stopping the service.

I've taken an excerpt from the log around the time when it occurred on one of the nodes, where the 15 minute 'silence' is the period where it became unresponsive.

[2016-02-17 11:11:56,968][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][young][35884][3275] duration [1.3m], collections [1]/[1.3m], total [1.3m]/[24.9m], memory [7gb]->[5.2gb]/[7.7gb], all_pools {[young] [2gb]->[73.7mb]/[2.1gb]}{[survivor] [253.2mb]->[273mb]/[273mb]}{[old] [4.7gb]->[4.8gb]/[5.3gb]}
[2016-02-17 11:13:26,811][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][young][35886][3276] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[26.3m], memory [7.1gb]->[5.4gb]/[7.7gb], all_pools {[young] [1.9gb]->[7.9mb]/[2.1gb]}{[survivor] [273mb]->[214.5mb]/[273mb]}{[old] [4.8gb]->[5.2gb]/[5.3gb]}
[2016-02-17 11:28:15,568][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][old][35899][239] duration [14.5m], collections [1]/[14.5m], total [14.5m]/[1.1h], memory [7.3gb]->[2.2gb]/[7.7gb], all_pools {[young] [1.9gb]->[10.8mb]/[2.1gb]}{[survivor] [214.5mb]->[0b]/[273mb]}{[old] [5.2gb]->[2.2gb]/[5.3gb]}
[2016-02-17 11:28:16,490][INFO ][discovery.zen ] [elasticsearch02.schultzprod.local] master_left [{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}], reason [failed to ping, tried [6] times, each with maximum [1m] timeout]
[2016-02-17 11:28:16,506][WARN ][discovery.zen ] [elasticsearch02.schultzprod.local] master left (reason = failed to ping, tried [6] times, each with maximum [1m] timeout), current nodes: {{elasticsearch04.schultzprod.local}{I63ADLtRRuqfSZvnujjydg}{10.76.173.53}{10.76.173.53:9300}{data=false, master=false},{elasticsearch02.schultzprod.local}{eBRzl0mKSyaigQHMplkKXw}{10.76.173.51}{10.76.173.51:9300},{elasticsearch03.schultzprod.local}{1jp-TRWuR8OfTLYqoSW-9Q}{10.76.173.52}{10.76.173.52:9300},}
[2016-02-17 11:28:16,662][INFO ][cluster.service ] [elasticsearch02.schultzprod.local] removed {{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300},}, reason: zen-disco-master_failed ({elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300})
[2016-02-17 11:28:28,626][INFO ][cluster.service ] [elasticsearch02.schultzprod.local] detected_master {elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}, added {{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300},}, reason: zen-disco-receive(from master [{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}])


(Allan Hansen) #2

After I've gotten some more resources allocated to the machines (16GB ram upped to 32GB ram) these instability issues don't happen quite as often, although it can still happen.

Our cluster is used for storing logs from source systems and is very much index-heavy.

What I'm struggling most with is how to investigate the issues and finding the root cause of the problems I'm running in to, in cases like this where nothing obvious is being logged.


(system) #3