The general symptom I'm experiencing is that seemingly random points in time, elasticsearch becomes inresponsive where neither the Http or Transport module will accept connections, giving a timeout when trying to connect to either of them.
I can reproduce it when logged onto the machine itself, making requests to the REST interface without going through network, i.e.
curl http://localhost:9200/
I started out running a one-machine cluster with ES 2.0, experiencing this issue during high load.
SInce then I've scaled out to a 3-machine cluster with 1 search node (node.master: false and node.data: false), upgraded to ES 2.1.1 and I'm still experiencing the exact same issues.
Except now that it's a cluster, during high load nodes constantly leaves and rejoins the cluster since it affects the Transport module as well, causing node pings to time out.
I'm running all nodes on Windows Server 2012R2.
I haven't found any way to investigate the problem. Whenever it happens, nothing is written to logs. When it happens, sometimes it will rejoin after 5-15 minutes, other times it never recovers and I have to kill the process manually - it won't even respond the stopping the service.
I've taken an excerpt from the log around the time when it occurred on one of the nodes, where the 15 minute 'silence' is the period where it became unresponsive.
[2016-02-17 11:11:56,968][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][young][35884][3275] duration [1.3m], collections [1]/[1.3m], total [1.3m]/[24.9m], memory [7gb]->[5.2gb]/[7.7gb], all_pools {[young] [2gb]->[73.7mb]/[2.1gb]}{[survivor] [253.2mb]->[273mb]/[273mb]}{[old] [4.7gb]->[4.8gb]/[5.3gb]}
[2016-02-17 11:13:26,811][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][young][35886][3276] duration [1.4m], collections [1]/[1.4m], total [1.4m]/[26.3m], memory [7.1gb]->[5.4gb]/[7.7gb], all_pools {[young] [1.9gb]->[7.9mb]/[2.1gb]}{[survivor] [273mb]->[214.5mb]/[273mb]}{[old] [4.8gb]->[5.2gb]/[5.3gb]}
[2016-02-17 11:28:15,568][WARN ][monitor.jvm ] [elasticsearch02.schultzprod.local] [gc][old][35899][239] duration [14.5m], collections [1]/[14.5m], total [14.5m]/[1.1h], memory [7.3gb]->[2.2gb]/[7.7gb], all_pools {[young] [1.9gb]->[10.8mb]/[2.1gb]}{[survivor] [214.5mb]->[0b]/[273mb]}{[old] [5.2gb]->[2.2gb]/[5.3gb]}
[2016-02-17 11:28:16,490][INFO ][discovery.zen ] [elasticsearch02.schultzprod.local] master_left [{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}], reason [failed to ping, tried [6] times, each with maximum [1m] timeout]
[2016-02-17 11:28:16,506][WARN ][discovery.zen ] [elasticsearch02.schultzprod.local] master left (reason = failed to ping, tried [6] times, each with maximum [1m] timeout), current nodes: {{elasticsearch04.schultzprod.local}{I63ADLtRRuqfSZvnujjydg}{10.76.173.53}{10.76.173.53:9300}{data=false, master=false},{elasticsearch02.schultzprod.local}{eBRzl0mKSyaigQHMplkKXw}{10.76.173.51}{10.76.173.51:9300},{elasticsearch03.schultzprod.local}{1jp-TRWuR8OfTLYqoSW-9Q}{10.76.173.52}{10.76.173.52:9300},}
[2016-02-17 11:28:16,662][INFO ][cluster.service ] [elasticsearch02.schultzprod.local] removed {{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300},}, reason: zen-disco-master_failed ({elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300})
[2016-02-17 11:28:28,626][INFO ][cluster.service ] [elasticsearch02.schultzprod.local] detected_master {elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}, added {{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300},}, reason: zen-disco-receive(from master [{elasticsearch01.schultzprod.local}{BS-vKOmwRHWAeOzwOLuorw}{10.76.173.50}{10.76.173.50:9300}])