Es instance unusable

lizhu · September 7, 2020, 11:43am

Hi

I have an es(version 6.8.0) cluster with 5 nodes which have 261 shards that max size is 40G and min size is 700M，vm nodes：16 cpu and 32 G.

It became unusable sometimes and print such logs

how to trouble shoot?
thanks all

warkolm · September 7, 2020, 9:00pm

Please don't post pictures of text, they are difficult to read, impossible to search and replicate (if it's code), and some people may not be even able to see them

Christian_Dahlqvist · September 8, 2020, 4:09am

It seems like you have a problem with heap pressure and that garbage collection is taking a very long time, which is causing problems. Can you please provide the full output of the cluster stats API? Can you also check if the VMs used are deployed with over-provisioned CPU or use memory balooning as this can cause very slow GC? It is very important for Elasticsearch to always have access to the memory allocated to it (heap and off heap) and that this is not swapped out to disk.

lizhu · September 14, 2020, 12:57pm

Thanks for your tips, the details of log is here

[2020-09-07T18:29:13,059][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][old][400080][46] duration [41s], collections [1]/[41.2s], total [41s]/[5.3m], memory [15.8gb]->[15.6gb]/[15.8gb], all_pools {[young] [865.3mb]->[765.7mb]/[865.3mb]}{[survivor] [73.2mb]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2020-09-07T18:29:13,059][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400080] overhead, spent [41s] collecting in the last [41.2s]
[2020-09-07T18:29:49,548][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][old][400081][47] duration [35.8s], collections [1]/[36.4s], total [35.8s]/[5.9m], memory [15.6gb]->[15.6gb]/[15.8gb], all_pools {[young] [765.7mb]->[769.7mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2020-09-07T18:29:49,548][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400081] overhead, spent [35.8s] collecting in the last [36.4s]
[2020-09-07T18:29:50,561][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400082] overhead, spent [661ms] collecting in the last [1s]
[2020-09-07T18:30:32,324][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][old][400083][48] duration [41.4s], collections [1]/[41.7s], total [41.4s]/[6.6m], memory [15.8gb]->[15.7gb]/[15.8gb], all_pools {[young] [865.3mb]->[794.2mb]/[865.3mb]}{[survivor] [45.7mb]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2020-09-07T18:30:32,324][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400083] overhead, spent [41.4s] collecting in the last [41.7s]
[2020-09-07T18:31:07,546][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][old][400084][49] duration [34.5s], collections [1]/[35.2s], total [34.5s]/[7.1m], memory [15.7gb]->[15.7gb]/[15.8gb], all_pools {[young] [794.2mb]->[831.9mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2020-09-07T18:31:07,546][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400084] overhead, spent [34.5s] collecting in the last [35.2s]
[2020-09-07T18:31:08,554][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400085] overhead, spent [700ms] collecting in the last [1s]
[2020-09-07T18:31:49,776][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][old][400086][50] duration [40.8s], collections [1]/[41.2s], total [40.8s]/[7.8m], memory [15.8gb]->[15.7gb]/[15.8gb], all_pools {[young] [865.3mb]->[816mb]/[865.3mb]}{[survivor] [50mb]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2020-09-07T18:31:49,776][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400086] overhead, spent [40.8s] collecting in the last [41.2s]
[2020-09-07T18:32:26,142][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][old][400087][51] duration [35.5s], collections [1]/[36.3s], total [35.5s]/[8.4m], memory [15.7gb]->[15.7gb]/[15.8gb], all_pools {[young] [816mb]->[840.5mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2020-09-07T18:32:26,143][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400087] overhead, spent [35.5s] collecting in the last [36.3s]
[2020-09-07T18:32:27,143][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400088] overhead, spent [750ms] collecting in the last [1s]
[2020-09-07T18:33:09,218][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][old][400089][52] duration [41.8s], collections [1]/[42s], total [41.8s]/[9.1m], memory [15.8gb]->[15.7gb]/[15.8gb], all_pools {[young] [865.3mb]->[865.3mb]/[865.3mb]}{[survivor] [62.5mb]->[373.3kb]/[108.1mb]}{[old] [14.9gb]->[14.9gb]/[14.9gb]}
[2020-09-07T18:33:09,218][WARN ][o.e.m.j.JvmGcMonitorService] [cl-en88xc9f-2] [gc][400089] overhead, spent [41.8s] collecting in the last [42s]

lizhu · September 14, 2020, 1:22pm

Thanks all, vms has no over-provisioned CPU and use memory balooning, I will provide output of the cluster stats API with the status occur next time. the status happend irregularly

PuffPuff_Pass · September 14, 2020, 1:43pm

Hello lizhu,
As best practices jvm should not have a higher heap memorey than 50% of the actual ram memory, if thejvm has 32gb of mem allocated that means the server should have 64Gb of ram memory, al though it'snot recommended to have more than 32gb allocated to jvm ( garbage collector bug kicks in when this occurs )

Does the server have swap enabled ? If yes, disable it as elastic is a "in memory db" and swap only slows it down, talking from experience.

As optimisation for heavy load usecases when creating the vm ( if it'snot in the cloud and you have accessto the underlying hypervisor ) it is recommended to hairpin the vcpu's of the server to the same NUMA, that way you do not use cycles and memory to send data from one phisical cpu to the other ( this adds latency )

How it works: you have 18 vcpu's on the VM, 16's from one NUMA0 and 2 from NUMA1, the queuemanager creates task queue for vcpu's using threads to be sent to the vCPUs and processed, but since not everything is on the same NUMA , queue manager needs to make also a buffer queue to wait for operations executed on the other NUMA. Let's say you have 20 tasks to be completed. You send 16 to NUMA0, 2 to NUMA1 and you have in the queue another 2 that will be executed on numa0, but they need to wait extra time until tasks destined to be prpcessed on numa1 are sent to ram then to numa1, processed by cpu on numa1, getback to ram, back to queue to be acknowledged and then the other 2 tasks are executed.

Hope this helps

lizhu · September 14, 2020, 2:39pm

Thanks for your reply, es has disable swap by configure bootstrap.mlockall: true reference configure memory , does es client operations produce some effects on this status?

system · October 12, 2020, 2:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[gc][2661] overhead, spent [263ms] collecting in the last [1s] Elasticsearch	3	257	April 25, 2023
Elasticsearch heap issues Elasticsearch	4	438	July 5, 2017
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2524	July 5, 2017
Elasticsearch High CPU Usage - GC Not Working Elasticsearch	26	7054	July 5, 2017
GC failing to reduce heap memory usage Elasticsearch	10	765	July 6, 2017

Es instance unusable

Related topics