One node is stuck with gc

Hi

I am using elasticsearch 2.4 version. Have a setup with 3 nodes each having 36cpu's, 32gc(es) + 32gb(lucen) setup.

Setup was running fine for 7days. huge and constant traffic is going on continuously. after 7 days one node started performing lots of young gen gc and it takes lot of time like 40sec. below are attached logs and hot threads link to check. Any leads can help a lot.

cpu utilisation is going very high. wondering what this node is doing compared to other nodes. initially we had rep fact set to 2, after seeing lot of delay caused by ES reduced rep fact to 1.

https://drive.google.com/drive/folders/10DVHNOq1yDxUUL3eas6W9PyYE-HC9jIF?usp=sharing

One of the nodes seem to be using swap, which could cause performance problems. It is recommend to disable swap on Elasticsearch nodes.

Hi

swap has been disabled in all machines from beginning. I have attached the configs, logs and hot threads merger seems to be very busy. And we use spin disk not ssd. merge thread config is also 1.

Why do we then se this in the node stats for node xCgbhMTmThK28AZG1nx7AA;

"os": {
  "timestamp": 1548077656890,
  "cpu_percent": 28,
  "load_average": 11.84,
  "mem": {
    "total_in_bytes": 126568382464,
    "free_in_bytes": 3714973696,
    "used_in_bytes": 122853408768,
    "free_percent": 3,
    "used_percent": 97
  },
  "swap": {
    "total_in_bytes": 8589930496,
    "free_in_bytes": 5121765376,
    "used_in_bytes": 3468165120
  }
}

each machine has kubernetis containers running, the swap happening due to other process. xCgbhMTmThK28AZG1nx7AA this node doesn't have any issue, its running properly. Issue is with node XNo15jgrSQupkQYFkpwU1Q

Hi I restarted the elastic search node which was having the issue, issue seems to be not appearing now. Can u give me some guideline or idea on what i can suspect or look into.

I would see that issue repeats after almost 18 hrs, and only in that particular node. Any help to see what could be the issue will be helpful

lots of gc happening in only one node and same particular node always. all nodes are similar. wondering what could be the cause

[2019-01-26 11:49:09,963][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][152972] . 
[3298] duration [11.6s], collections [1]/[12.5s], total [11.6s]/[14m], memory [17.8gb]-> 
[6.9gb]/[29.4gb], all_pools {[young] [12.2gb]->[40.2mb]/[12.4gb]}{[survivor] [1.5gb]-> 
[1.5gb]/[1.5gb]}{[old] [4gb]->[5.3gb]/[15.5gb]}

[2019-01-26 11:49:46,400][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153003] . 
[3299] duration [5.4s], collections [1]/[6.3s], total [5.4s]/[14.1m], memory [18.2gb]-> 
[8.3gb]/[29.4gb], all_pools {[young] [11.3gb]->[52.6mb]/[12.4gb]}{[survivor] [1.5gb]-> 
[1.5gb]/[1.5gb]}{[old] [5.3gb]->[6.7gb]/[15.5gb]}

[2019-01-26 11:50:17,880][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153034] . 
[3300] duration [1s], collections [1]/[1.4s], total [1s]/[14.1m], memory [20.4gb]-> 
[8.4gb]/[29.4gb], all_pools {[young] [12.1gb]->[29.1mb]/[12.4gb]}{[survivor] [1.5gb]-> 
[1.5gb]/[1.5gb]}{[old] [6.7gb]->[6.8gb]/[15.5gb]}

 [2019-01-26 11:50:44,650][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153058] . 
[3301] duration [2.7s], collections [1]/[3.7s], total [2.7s]/[14.1m], memory [20.4gb]-> 
[9.5gb]/[29.4gb], all_pools {[young] [11.9gb]->[17.9mb]/[12.4gb]}{[survivor] [1.5gb]-> 
[1.5gb]/[1.5gb]}{[old] [6.8gb]->[7.9gb]/[15.5gb]}

[2019-01-26 11:52:11,776][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153130][3302] duration [3.5s], collections [1]/[3.8s], total [3.5s]/[14.2m], memory [19.2gb]->[7.4gb]/[29.4gb], all_pools {[young] [12.3gb]->[6.5mb]/[12.4gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] [5.3gb]->[5.9gb]/[15.5gb]}

[2019-01-26 11:54:05,202][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153202][3303] duration [22.8s], collections [1]/[23.2s], total [22.8s]/[14.6m], memory [17.3gb]->[6.6gb]/[29.4gb], all_pools {[young] [12.2gb]->[94.9mb]/[12.4gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] [3.5gb]->[5gb]/[15.5gb]}

[2019-01-26 11:57:03,648][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153294][3305] duration [43.7s], collections [2]/[44.2s], total [43.7s]/[15.3m], memory [18.9gb]->[8.7gb]/[29.4gb], all_pools {[young] [12.3gb]->[271.3mb]/[12.4gb]}{[survivor] [1.5gb]->[378.6mb]/[1.5gb]}{[old] [5gb]->[8.1gb]/[15.5gb]}

[2019-01-26 12:00:03,996][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153406][3306] duration [14.4s], collections [1]/[14.9s], total [14.4s]/[15.5m], memory [20.4gb]->[9.7gb]/[29.4gb], all_pools {[young] [12.3gb]->[24.4mb]/[12.4gb]}{[survivor] [378.6mb]->[1.5gb]/[1.5gb]}{[old] [7.7gb]->[8.1gb]/[15.5gb]}

Hi

Can anyone pls help me with this, i am continuously hitting this issue and blocked to continue. Any guidance on how to proceed could help

Is this much of data shown below is too much in case of ES 3 data nodes, each having 31+32(lucen)gb memory

curl localhost:9200/_cat/indices
green open metrics-2019.02.13   3 2   5799316 0  10.1gb   3.3gb 
green open metrics-2019.02.06   3 2   4400282 0   9.2gb     3gb 
green open metrics-2019.02.07   3 2  10657640 0  22.3gb   7.4gb 
green open metrics-2019.02.08   3 2  10652504 0  22.4gb   7.4gb 
green open metrics-2019.02.09   3 2  10652500 0  22.6gb   7.5gb 
green open metrics-2019.02.07-1 3 2  73252166 0  40.2gb  13.4gb 
green open logs-2019.02.13.13   3 2  26727277 0  24.5gb   8.1gb 
green open metrics-2019.02.06-1 3 2  30291630 0  16.4gb   5.4gb 
green open logs-2019.02.13.14   3 2  20068073 0  18.4gb   6.2gb 
green open metrics-2019.02.09-1 3 2  74120995 0  40.8gb  13.5gb 
green open logs-2019.02.13.15   3 2    278152 0 201.3mb 104.3mb 
green open metrics-2019.02.08-1 3 2  74500069 0  40.7gb  13.6gb 
green open metrics-2019.02-10   3 2  64485524 0  78.1gb  26.1gb 
green open metrics-2019.02.10   3 2  10652206 0  22.7gb   7.5gb 
green open logs-2019.02.13.10   3 2  26697205 0  24.5gb   8.1gb 
green open logs-2019.02.13.11   3 2  22676059 0  20.8gb   6.9gb 
green open metrics-2019.02.11   3 2  10525632 0  22.2gb   7.4gb 
green open metrics-2019.02.12   3 2   9606808 0  17.9gb   5.9gb 
green open logs-2019.02.13.12   3 2  22353715 0  22.3gb   6.8gb 
green open metrics-2019.02.10-1 3 2 102968475 0  57.2gb    19gb 
green open metrics-2019.02.12-1 3 2  71503558 0  38.7gb  12.9gb 
green open metrics-2019.02.11-1 3 2  88552753 0  48.9gb  16.2gb 
 green open metrics-2019.02.13-1 3 2  67064281 0  34.7gb  11.8gb
 index.codec.bloom.load=false 

is this going to impact anyway?

Hi we also delete one index every hour, so elasticsearch maintains any tombstone kind of things in elasticsearch? if so what is the safest way to remove a index every hour? should i need to worry about this ???

Elasticsearch 2.4 was released in August 2016, saw about a year of maintenance and bug fixes, and reached the end of its supported life nearly a year ago. I don't think I have a development environment in which I can even build it any more, let alone dig into this kind of memory issue. The best path forward is to upgrade to a more recent, supported, version since these incorporate many improvements in resilience.

1 Like

We are sure going to upgrade ES to latest release. Now we are in a stage where we cannot upgrade immediately. If u can confirm that these kind off issues are there in ES 2.4 is also helpful to us... any work around guidance will be great...

Hi

are these related to my case?

https://issues.apache.org/jira/browse/LUCENE-7647

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.