ElasticSearch crashing - Magento 2.4.3

@warkolm we're really stuck here, the GC is not stopping collection. Just keeps doing collection without stopping.

Why is this happening?

[2021-09-24T10:22:45,659][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/_alias/example-amasty_product_1][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43032}] took [6056ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:03,076][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [web1.example.com] GC did not bring memory usage down, before [43895040240], after [43963860472], allocations [1], duration [17422]
[2021-09-24T10:23:03,077][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/_alias/example-amasty_product_1][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43106}] took [5961ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:14,988][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/_alias/example-amasty_product_1][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43130}] took [29336ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:14,988][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/_alias/example-amasty_product_1][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43192}] took [17873ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:08,668][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/_alias/example-amasty_product_1][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43144}] took [5592ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:20,820][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [web1.example.com] attempting to trigger G1GC due to high heap usage [43947457024]
[2021-09-24T10:23:26,343][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [web1.example.com] GC did not bring memory usage down, before [43947457024], after [43955971096], allocations [0], duration [5523]
[2021-09-24T10:23:26,343][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43192}] took [5832ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:32,199][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/_alias/example-amasty_product_1][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43130}] took [11379ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:32,199][INFO ][o.e.m.j.JvmGcMonitorService] [web1.example.com] [gc][old][378][53] duration [46.3s], collections [8]/[40.6s], total [46.3s]/[5m], memory [40.8gb]->[40.9gb]/[41gb], all_pools {[young] [0b]->[0b]/[0b]}{[old] [40.8gb]->[40.9gb]/[41gb]}{[survivor] [0b]->[0b]/[0b]}
[2021-09-24T10:23:44,596][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43464}] took [6382ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:44,596][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43484}] took [6382ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:50,548][WARN ][o.e.m.j.JvmGcMonitorService] [web1.example.com] [gc][378] overhead, spent [46.5s] collecting in the last [40.6s]
[2021-09-24T10:23:50,548][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43432}] took [12396ms] which is above the warn threshold of [5000ms]
[2021-09-24T10:23:56,531][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [web1.example.com] attempting to trigger G1GC due to high heap usage [43976739808]
[2021-09-24T10:24:37,882][WARN ][o.e.h.AbstractHttpServerTransport] [web1.example.com] handling request [null][HEAD][/_alias/example-amasty_product_1][Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43192}] took [71539ms] which is above the warn threshold of [5000ms]

Is there a way to clean out garbage collection? Or stop it?

I have even tried:

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "1000gb",
    "cluster.routing.allocation.disk.watermark.high": "600gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "400gb",
    "cluster.info.update.interval": "1m"
  }
}
'

This did not help.

Hmm, I noticed something:

The garbage collector thread in htop was running with only 8GB, yet the JVM option was set to 31GB (and it was taking up 31GB).

I checked Elasticsearch folder and saw that my sysadmin set jvm heap option as a separate file:

/etc/Elasticsearch/jvm.options.d/heap_size.options

^^^ this is where the 31GB was set.

But in the main file jvm.options the option was set to 8GB.

It is very strange that these options would disconnect GC threads into different memory setting.

I renamed the /etc/Elasticsearch/jvm.options.d/heap_size.options file, changed main file to 31GB instead of 8GB and trying to reindex right now to see if it helps.

The new GC threads are showing 31GB now finally.

I'll report back after reindex.

It seems to have settled it down, but as soon as any attribute data is changed it spikes up and just blocks the cluster from being accessed. GC is still collecting every 5-10 seconds.

Hi,

Why do you have to keep on increasing the RAM for VM. I will give you one suggestion. go with / create ELK in CLuster (3 Node Cluster). then configure auto-start script using shell script. if the services are going down. then shell should restart/start the services.

shell script checker should check the service every 2 min.

if you won't let me know I will share the shell script.

  • Note:
  1. Elasticsearch often uses the JVM memory if the load increase JVM eats full memory otherwise you will get out of memory issues.
  2. you can't keep on get increased infrastructure for ELK VMs.
  3. I prefer the 8VCPU and 16 RAM is more than sufficient for one Node ELK.
  4. I have enabled Elasticsearch, Kibana, logstash, td-agent, FileBeat, MetricBeat,
    in each VMs.

if you want any help on elk, kindly drop an email to ezhilrean@gmail.com

I solved the issue - it was a bad install. We have installed older version of ES 7.9.2 directly from ES repo. This now matches my dev environment.

Apparently sysadmin installed previous ES from some random repo (CentOS/RHEL Packages Repository - GetPageSpeed) and not directly from ES repo.

So that might have been one of the issues.

We also removed Java and reinstalled it.

Once that was done, ES is singing now. No crazy GC every 5 seconds. All issues went away.

The only thing I changed on the new install was heap size 31GB and name of the node / cluster. Everything else left as is.

Thanks for the help @warkolm!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.