ElasticSearch crashing - Magento 2.4.3

We're having issues with Elasticsearch crashing from time to time. It also sometimes spikes up RAM + CPU and server becomes unresponsive.

We have left most of the settings as is, but had to add more RAM to JVM heap (48GB) to get it not to crash frequently.

I started digging and apparently 32GB is the max you should be using. We'll tweak that.

The server is:

CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME
^^^ there is more than enough hardware to handle something like this, but something tells me there needs to be more configuration done to handle this much data.

We're running a Magento 2.4.3 CE store with about 400,000 products .

Here are all of our config files:

**jvm.options file**


 ## JVM configuration
    
    ################################################################
    ## IMPORTANT: JVM heap size
    ################################################################
    ##
    ## You should always set the min and max JVM heap
    ## size to the same value. For example, to set
    ## the heap to 4 GB, set:
    ##
    ## -Xms4g
    ## -Xmx4g
    ##
    ## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
    ## for more information
    ##
    ################################################################
    
    # Xms represents the initial size of total heap space
    # Xmx represents the maximum size of total heap space
    
    -Xms48g
    -Xmx48g
    
    ################################################################
    ## Expert settings
    ################################################################
    ##
    ## All settings below this section are considered
    ## expert settings. Don't tamper with them unless
    ## you understand what you are doing
    ##
    ################################################################
    
    ## GC configuration
    8-13:-XX:+UseConcMarkSweepGC
    8-13:-XX:CMSInitiatingOccupancyFraction=75
    8-13:-XX:+UseCMSInitiatingOccupancyOnly
    
    ## G1GC Configuration
    # NOTE: G1 GC is only supported on JDK version 10 or later
    # to use G1GC, uncomment the next two lines and update the version on the
    # following three lines to your version of the JDK
    # 10-13:-XX:-UseConcMarkSweepGC
    # 10-13:-XX:-UseCMSInitiatingOccupancyOnly
    14-:-XX:+UseG1GC
    14-:-XX:G1ReservePercent=25
    14-:-XX:InitiatingHeapOccupancyPercent=30
    
    ## DNS cache policy
    # cache ttl in seconds for positive DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.ttl; set to -1 to cache forever
    -Des.networkaddress.cache.ttl=60
    # cache ttl in seconds for negative DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.negative ttl; set to -1 to cache
    # forever
    -Des.networkaddress.cache.negative.ttl=10
    
    ## optimizations
    
    # pre-touch memory pages used by the JVM during initialization
    -XX:+AlwaysPreTouch
    
    ## basic
    
    # explicitly set the stack size
    -Xss1m
    
    # set to headless, just in case
    -Djava.awt.headless=true
    
    # ensure UTF-8 encoding by default (e.g. filenames)
    -Dfile.encoding=UTF-8
    
    # use our provided JNA always versus the system one
    -Djna.nosys=true
    
    # turn off a JDK optimization that throws away stack traces for common
    # exceptions because stack traces are important for debugging
    -XX:-OmitStackTraceInFastThrow
    
    # enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
    # they are supported
    14-:-XX:+ShowCodeDetailsInExceptionMessages
    
    # flags to configure Netty
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    
    # log4j 2
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    
    -Djava.io.tmpdir=${ES_TMPDIR}
    
    ## heap dumps
    
    # generate a heap dump when an allocation from the Java heap fails
    # heap dumps are created in the working directory of the JVM
    -XX:+HeapDumpOnOutOfMemoryError
    
    # specify an alternative path for heap dumps; ensure the directory exists and
    # has sufficient space
    -XX:HeapDumpPath=/var/lib/elasticsearch
    
    # specify an alternative path for JVM fatal error logs
    -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
    
    ## JDK 8 GC logging
    
    8:-XX:+PrintGCDetails
    8:-XX:+PrintGCDateStamps
    8:-XX:+PrintTenuringDistribution
    8:-XX:+PrintGCApplicationStoppedTime
    8:-Xloggc:/var/log/elasticsearch/gc.log
    8:-XX:+UseGCLogFileRotation
    8:-XX:NumberOfGCLogFiles=32
    8:-XX:GCLogFileSize=64m
    
    # JDK 9+ GC logging
    9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
    # due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
    # time/date parsing will break in an incompatible way for some date patterns and locals
    9-:-Djava.locale.providers=COMPAT
    
    # temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
    10-:-XX:UseAVX=2


**elasticsearch.yml**

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes: 
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true

Some outputs from querying the Elasticsearch:

curl -XGET --user username:password http://localhost:9200/

{
  "name" : "web1.example.com",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
  "version" : {
    "number" : "7.13.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
    "build_date" : "2021-06-10T21:01:55.251515791Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

curl --user username:password -sS http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 55.55555555555556
}

curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "example-amasty_product_1_v156",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-09-14T16:52:28.854Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2THEUTSaQdmOJAAhTTN71g",
      "node_name" : "web1.example.com",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "134622244864",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "51539607552"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node"
        }
      ]
    }
  ]
}

^^^ The issue is that I don't know how to set up multiple nodes on one machine.

The misconfig from what I understand is that we're running one node only . From my readings 3 master nodes is required for green status.

How do I set up multiple nodes on a single machine and do I need to increase data nodes ?

My main suspicions:

  • not enough master / data nodes
  • newer Garbage Collector is having issues (G1GC is enabled - I'm not sure how to determine which one is currently enabled from the config) - -- ALREADY FIGURED IT OUT - G1 is used.
  • no recovery setup in case of crash (gateway.expected_nodes, gateway.recover_after_time)

UPDATE:

Here is the error log from elasticsearch.log

https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=

Welcome to our community! :smiley:

I've been commenting on your SO post as well, so the advice is the same.

A few things

  • high cpu or memory use will not be due to not setting those gateway settings, and as a single node cluster they are somewhat irrelevant
  • we recommend keeping heap <32GB, see Advanced configuration | Elasticsearch Guide [7.14] | Elastic
  • you can never allocate replica shards on the same node as the primary. thus for a single node cluster you either need to remove replicas (risky), or add another (ideally) 2 nodes to the cluster
  • setting up a multiple node cluster on the same host is a little pointless. sure your replicas will be allocated, but if you lose the host you lose all of the data anyway

I'd definitely suggest looking at Bootstrap Checks | Elasticsearch Guide [7.14] | Elastic and applying the settings it talks about, because even if you are running a single node those are what we refer to as production-ready settings and there are performance related settings in there (eg memory_lock).

Your logs show quite a lot of GC, so I would start with those bootstrap checks.
Also what is the output from the _cluster/stats?pretty&human API?

(And please use gist/pastebin/etc for logs etc, having to download them is kinda clunky :slight_smile: )

Thank you for your help!

Here is the output of the _cluster/stats?pretty&human

Sorry, I'm new to handling elasticsearch. But being the only Magento dev in the whole company, I have to start handling it - the learning curve is a bit steep.

That's ok! We'll do what we can to help.

Thanks for that. It doesn't show a bunch other than perhaps a bit of custom mapping that might be worth looking at later.

Is it possible for you to go over those bootstrap checks and do things like memory_lock, that should at best, and at worst it puts things at production level configs.

The Elasticsearch is up and running right now so the output might not show the exact issue.

I will go through the bootstrap tomorrow. For now my sysadmin is handling restarts until we figure out the bootstrap. Its been up for last 8 hours or so.

Thank you for heads up about bootstrap settings, I will work on those tomorrow.

I have reduced the amount of data that is being pushed into the Elasticsearch and stopped FPC warmer - that might help with the overall load on the Elasticsearch.

Each catalog page / product page on Magento sites right now is requesting queries into ElasticSearch so when we went from Magento 2.3.3 which used partially MySQL queries to Magento 2.4.3 which uses only Elasticsearch, the load on it probably increased dramatically.

Are you using Monitor a cluster | Elasticsearch Guide [7.14] | Elastic at all?

It'll show you quite a lot of what's happening in Elasticsearch, it might help pinpoint some of the issues.
Otherwise look at Slow Log | Elasticsearch Guide [7.14] | Elastic (which are stored on the filesystem with the Elasticsearch log) and Nodes hot threads API | Elasticsearch Guide [7.14] | Elastic to see if anything pops up.

1 Like

Can you please explain this part?

Is this looked up in the elasticsearch.yml? Or is there a curl command to do this?

These ones - Bootstrap Checks | Elasticsearch Guide [7.14] | Elastic
They are config changes made in elasticsearch.yml for the most part, but there are some OS level settings that are recommended as well.

1 Like

Not really. My Sysadmin has setup Munin to view performance graphs of the Elasticsearch, but it does not really give a lot of information.

image

image

image

image

image

image

image

You can see it crashed about 6 AM today and was down for a while.

Here is the output of the hot threads query:

::: {web1.example.com}{2THEUTSaQdmOJAAhTTN71g}{5jDWSYFvSYuqcKzjmqlREA}{127.0.0.1}{127.0.0.1:9300}{cdfhilmrstw}{ml.machine_memory=134622244864, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=51539607552}
   Hot threads at 2021-09-14T23:35:24.172Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   23.2% (116.2ms out of 500ms) cpu usage by thread 'elasticsearch[web1.example.com][search][T#40]'
     10/10 snapshots sharing following 15 elements
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:385)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:439)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:411)
       app//org.elasticsearch.search.SearchService$$Lambda$6306/0x0000000801a238c8.get(Unknown Source)
       app//org.elasticsearch.search.SearchService$$Lambda$6307/0x0000000801a23af0.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$6117/0x00000008019d66b8.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@16/java.lang.Thread.run(Thread.java:831)

I have a question about

If we are on a single server, does this need to be set? What can we set it to:

discovery.seed_hosts:

  • localhost:9200

?

I'm also confused about the ports - some of the configuration settings are mentioning 9200 and some 9300 - is there a difference?

I'm going through bootstrap checks right now and making changes, hopefully we can fix some of these issues.

One more question:

On the stackoverflow somebody mentioned that setting number_of_replicas to 0 would solve the yellow status issue.

The setting is done through a PUT:

PUT /my-index-000001/_settings
{
  "index" : {
    "number_of_replicas" : 2
  }
}

Our index changes every day with a product import, so the my-index-000001 number changes to my-index-000002 and so on.

Can this setting be applied manually through elasticsearch.yml configuration without setting it to a specific index?

You can skip the discovery bootstrap check for the moment.

As for the replicas, yes you can set them via the APIs, but not via the config. You will want to look at the _template API to set that.

1 Like

Thanks!

Here's what I've done so far:

I figured out how to limit the number of replicas.

This can be done via templates:

PUT _template/all
{
  "template": "*",
  "settings": {
    "number_of_replicas": 0
  }
}

I will be testing it tomorrow if it makes an effect and makes the status green.

I don't think it will do anything performance wise, but we'll see.

I'm working through other suggestions:

  • Limited RAM use to 31GB
  • File descriptor is already set to 65535
  • Maximum number of threads is already set to 4096
  • Maximum size virtual memory check is already increased and configured
  • Maximum map count bumped to 262144
  • G1GC is disabled (by default)

One thing I'm trying is to reduce the:

8-13:-XX:CMSInitiatingOccupancyFraction=75

to

8-13:-XX:CMSInitiatingOccupancyFraction=70

I believe this will speed up garbage collection and will prevent out of ram errors. We'll try to adjust this up/down to see it if helps.

Switch to G1GC

I realize that this not really encouraged, but there are articles about this dealing with similar issues of out of memory where switching to G1GC helped resolve the issue: Garbage Collection in Elasticsearch and the G1GC | by Prabin Meitei M | Naukri Engineering | Medium

This is going to be the last thing I'm going to try.

It might, but it's probably not worth altering.

You should be using this already really, it's the default with later versions of Elasticsearch and JVMs.

1 Like

Are there any specific settings we should be using for ingesting a large amount of data for this garbage collector? (XX:G1HeapRegionSize, -XX:GCPauseTimeInterval, -XX:MaxGCPauseMillis)

image

As you can see we have BIG deletes and then reindex after a product import comes through. I think this delete action is actually causing our issues.

Don't change the JVM options other than heap is the recommendation.

Deleting documents or indices?

I believe documents. I don't really know the internals of the application that is handling adding / removing data.

From what I understand, it versions each of the indices every time it does a reindex and then deletes the old ones:

So lets say it indexes the new products, it will create a new index:

yellow open example-amasty_product_1_v162 6Qmg2ziQTO20YSX-oFYEIw 1 1 88867 0 174.5mb 174.5mb

Then the next day it will create a new one:

yellow open example-amasty_product_1_v163 6Qmg2ziQTO20YSX-oFYEIw 1 1 88867 0 174.5mb 174.5mb

It then deletes the v162 to save space.

I think this is what the document delete action is doing. I'm not 100% sure, but this is what I assume is happening.

Looks like the spikes have leveled off finally.

It's performing well, no issues overnight. There are some GC collection spikes from time to time, but much better than before.

Its not as snappy as it was with 50GB of RAM, but it still performs.

Thank you for all the help! I'll report back if anything changes.

Hmmm, we added a bit more data and elastic service crashed again.

As soon as we reduced it back it came back up. Its very strange - even looking at the index it should be more than enough to cover with 32GB:

health status index uuid pri rep docs.count docs.deleted store.size

green open example-amasty_product_1_v196 8fzoc6m8SV24d3T3gyILZA 1 0 89658 0 197.8mb

Hard limits are up to 20 million records - I doubt we have that many.