Aggregation query making elastic datanode offline due to long GC

we are running 5 node cluster version 2.3.5. one as a client node and 4 node configured to work as Data and Master nodes. all nodes has 32GB heap size. below are the JAVA OPTS with which process is configured.

-Xms32g -Xmx32g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true

now we are running below search query on index XYZ of 97GB which is making ES nodes to go offline because of long GC run.

{"from": 0,"size": 0,"query": {"bool": {"must": {"bool": {"must": [{"match": {"id": {"query": "9999","type": "phrase"}}},{"match": {"cltp": {"query": "encounters","type": "phrase"}}}]}}}},"_source": {"includes": ["COUNT"],"excludes": []},"aggregations": {"eid": {"terms": {"field": "eid","size": 200},"aggregations": {"COUNT(DISTINCT eid)": {"cardinality": {"field": "eid","precision_threshold": 40000}}}}}}

below is the shard distribution of index.
DN1: Replica 0,2,3
DN2: Primary 0,1,4
DN3: Primary 2,3
DN4:Replica 1,4

during the query node DN1 and DN4 falling out of cluster due to long GC cycles. below are logs for same.
DN1:
[2017-11-10 08:27:25,898][INFO ][monitor.jvm ] [10.0.1.134] [gc][old][1076][3] duration [7.4s], collections [1]/[7.4s], total [7.4s]/[16.9s], memory [31.5gb]->[31.8gb]/[31.8gb], all_pools {[young] [614.5mb]->[865.3mb]/[865.3mb]}{[survivor] [0b]->[107.4mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}
[2017-11-10 08:28:17,615][INFO ][monitor.jvm ] [10.0.1.134] [gc][old][1077][18] duration [1.2m], collections [15]/[1.2m], total [1.2m]/[1.5m], memory [31.8gb]->[31.8gb]/[31.8gb], all_pools {[young] [865.3mb]->[865.3mb]/[865.3mb]}{[survivor] [107.4mb]->[108mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}
[2017-11-10 08:28:49,877][INFO ][monitor.jvm ] [10.0.1.134] [gc][old][1080][26] duration [12.1s], collections [2]/[12.1s], total [12.1s]/[2.1m], memory [31.8gb]->[31.8gb]/[31.8gb], all_pools {[young] [865.3mb]->[865.3mb]/[865.3mb]}{[survivor] [108mb]->[108mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}

DN4:
[2017-11-10 08:26:36,566][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1074][32] duration [2.3s], collections [1]/[2.4s], total [2.3s]/[11.2s], memory [19.6gb]->[20.4gb]/[31.8gb], all_pools {[young] [17.5mb]->[8.7mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [19.5gb]->[20.3gb]/[30.9gb]}
[2017-11-10 08:26:38,016][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1075][33] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[12.5s], memory [20.4gb]->[21.3gb]/[31.8gb], all_pools {[young] [8.7mb]->[26mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [20.3gb]->[21.2gb]/[30.9gb]}
[2017-11-10 08:26:39,693][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1076][34] duration [1.5s], collections [1]/[1.6s], total [1.5s]/[14.1s], memory [21.3gb]->[22.1gb]/[31.8gb], all_pools {[young] [26mb]->[28.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [21.2gb]->[22gb]/[30.9gb]}
[2017-11-10 08:26:41,217][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1077][35] duration [1.4s], collections [1]/[1.5s], total [1.4s]/[15.6s], memory [22.1gb]->[23gb]/[31.8gb], all_pools {[young] [28.4mb]->[27.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [22gb]->[22.8gb]/[30.9gb]}
[2017-11-10 08:26:42,990][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1078][36] duration [1.6s], collections [1]/[1.7s], total [1.6s]/[17.2s], memory [23gb]->[23.8gb]/[31.8gb], all_pools {[young] [27.4mb]->[17.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [22.8gb]->[23.7gb]/[30.9gb]}
[2017-11-10 08:26:45,167][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1079][37] duration [2s], collections [1]/[2.1s], total [2s]/[19.3s], memory [23.8gb]->[24.7gb]/[31.8gb], all_pools {[young] [17.4mb]->[34.6mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [23.7gb]->[24.5gb]/[30.9gb]}
[2017-11-10 08:26:47,178][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1080][38] duration [1.9s], collections [1]/[2s], total [1.9s]/[21.2s], memory [24.7gb]->[25.5gb]/[31.8gb], all_pools {[young] [34.6mb]->[35.2mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [24.5gb]->[25.4gb]/[30.9gb]}
[2017-11-10 08:26:48,714][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1081][39] duration [1.4s], collections [1]/[1.5s], total [1.4s]/[22.7s], memory [25.5gb]->[26.3gb]/[31.8gb], all_pools {[young] [35.2mb]->[25.9mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [25.4gb]->[26.2gb]/[30.9gb]}
[2017-11-10 08:26:50,547][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1082][40] duration [1.7s], collections [1]/[1.8s], total [1.7s]/[24.4s], memory [26.3gb]->[27.2gb]/[31.8gb], all_pools {[young] [25.9mb]->[27.5mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [26.2gb]->[27gb]/[30.9gb]}
[2017-11-10 08:26:52,066][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1083][41] duration [1.4s], collections [1]/[1.5s], total [1.4s]/[25.9s], memory [27.2gb]->[28gb]/[31.8gb], all_pools {[young] [27.5mb]->[26.1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [27gb]->[27.9gb]/[30.9gb]}
[2017-11-10 08:26:53,067][INFO ][monitor.jvm ] [10.0.1.95] [gc][young][1084][42] duration [879ms], collections [1]/[1s], total [879ms]/[26.7s], memory [28gb]->[29.3gb]/[31.8gb], all_pools {[young] [26.1mb]->[472.2mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [27.9gb]->[28.7gb]/[30.9gb]}
[2017-11-10 08:27:04,654][WARN ][monitor.jvm ] [10.0.1.95] [gc][old][1086][2] duration [10.3s], collections [1]/[10.4s], total [10.3s]/[10.4s], memory [30.5gb]->[31.3gb]/[31.8gb], all_pools {[young] [25.9mb]->[450.2mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [30.4gb]->[30.9gb]/[30.9gb]}
[2017-11-10 08:27:12,423][INFO ][monitor.jvm ] [10.0.1.95] [gc][old][1087][3] duration [7.7s], collections [1]/[7.7s], total [7.7s]/[18.1s], memory [31.3gb]->[31.8gb]/[31.8gb], all_pools {[young] [450.2mb]->[865.3mb]/[865.3mb]}{[survivor] [0b]->[105.6mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}

Question: how can we stop or minimise GC "STOP-THE-WORLD" issue?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.