GC - Node Freezes and all operations fails from client

Hey All,

Currently, I am running ES 1.7.1 with 8GB heap on a 16 CPU core machine. I am doing filter/Aggregation operation on demand. Node was good for few days but suddenly I started seeing GC logs

[monitor.jvm ] [vManage Node] [gc][young][1342325][31487] duration [1s], collections [1]/[1.3s], total [1s]/[8.6m], memory [7.2gb]->[7.1gb]/[9.5gb], all_pools {[young] [480.8mb]->[8.5mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [6.6gb]->[7gb]/[8.6gb]}

and all operations from java client started failing

 org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes were available
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [Node][inet[/]][indices:data/write/index] disconnected


[transport] (elasticsearch[Blob][generic][T#22]) [Blob] failed to get local cluster state for [Node][][localhost][inet[/]], disconnecting...: org.elasticsearch.transport.ReceiveTimeoutTransportException: [Node][inet[/]][cluster:monitor/state] request_id [308108] timed out after [15001ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529) [elasticsearch-1.7.1.jar:]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_25]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_25]
	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_25]

Why does server become not responsive when it goe's on GC? It is supposed to be run in the background without disconnecting the client.

Can anyone help me with this issue?

GC is a stop the world process, but it shouldn't really be causing major issues when it is that short.
Is the node overloaded?

This happened when I was trying to do bulk insert operation and each bulk was less than 5MB or 5000 records.

[2016-05-05 15:10:08,126][INFO ][cluster.metadata         ] [Node] [index1] update_mapping [index1-2016-05-05T15] (dynamic)
[2016-05-05 15:23:54,781][WARN ][monitor.jvm              ] [Node] [gc][young][1342325][31487] duration [1s], collections [1]/[1.3s], total [1s]/[8.6m], memory [7.2gb]->[7.1gb]/[9.5gb], all_pools {[young] [480.8mb]->[8.5mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [6.6gb]->[7gb]/[8.6gb]}
[2016-05-05 15:24:02,141][WARN ][monitor.jvm              ] [Node] [gc][young][1342331][31492] duration [1.4s], collections [1]/[1.9s], total [1.4s]/[8.6m], memory [6.7gb]->[6.3gb]/[9.5gb], all_pools {[young]     [552.9mb]->[2.8mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [6gb]->[6.2gb]/[8.6gb]}
[2016-05-05 15:44:26,114][WARN ][monitor.jvm              ] [Node] [gc][old][1342993][38] duration [39.3s], collections [1]/[40.1s], total [39.3s]/[7.9m], memory [9.5gb]->[8.9gb]/[9.5gb], all_pools {[young] [865.3mb]->[350.4mb]/[865.3mb]}{[survivor] [64.6mb]->[0b]/[108.1mb]}{[old] [8.6gb]->[8.6gb]/[8.6gb]}
[2016-05-05 15:45:05,751][WARN ][monitor.jvm              ] [Node] [gc][old][1342996][39] duration [36.8s], collections [1]/[37s], total [36.8s]/[8.5m], memory [9.5gb]->[7.8gb]/[9.5gb], all_pools {[young] [865.3mb]->[5.3mb]/[865.3mb]}{[survivor] [95.9mb]->[0b]/[108.1mb]}{[old] [8.6gb]->[7.8gb]/[8.6gb]}
[2016-05-05 16:00:01,030][INFO ][cluster.metadata         ] [Node] [index1] update_mapping [index1-2016-05-05T16] (dynamic)

Approximately it took 1 hour to finish the GC. In this time, client was totally useless. How can we handle this kind of scenario gracefully?

Your node is running out of memory and the JVM is having to do full gc's to try and free up space. Restart it. The reasons it could be running out of memory could be anything from just too much data for the available memory, a memory leak in ES, or many other potential reasons. There is a number of threads in this group that have solutions to issues like yours.

The general solution is to cluster with 3 nodes such that a loss of any single node won't kill the cluster.