Why so huge memory requirements when creating new replicas?

Hello,
does anyone know what exactly is going on on data nodes when number of replica setting is increased and new replicas are started to build?

When I change the replica number from 0 to 1 on some slightly larger index (10 shards, 15GB and 2M documents each) I see fast and unlimitted memory consumption on those data nodes which serve as the source of data for new replicas.

How can I find out what is this memory allocated for?

Is it possible to limit the amount of allocated memory somehow?

I would like to limit this memory consumption peaks to some acceptable level (and yes, I don't consider consuming 50G of heap in 4 minutes acceptable :slight_smile: ).

You can see the progress of memory consumption in the picture bellow. Yellow mark is the time when the number of replicas was increased.

It looks like it increased during allocation and then dropped off, is that fair to say?

Yes it is, but I want to know the details because this behaviour in combination with Parent Circuit Breaker causes cluster instability.

That's why I want to know, what exactly is allocated and if there is any way how to set some limit to it.

More heap doesn't help, it seems that Elasticsearch tries to allocate all available memory and it is not good for smooth node running.

What version are you asking about? This area has been changed a lot in the past few versions so it's hard to give a detailed answer without knowing that.

Also what are the values of cluster.routing.allocation.node_concurrent_recoveries, cluster.routing.allocation.node_concurrent_incoming_recoveries and cluster.routing.allocation.node_concurrent_outgoing_recoveries in your cluster?

I'm sorry, my fault.

Our version is 7.8.0

Routing section of our cluster settings follows (everything is in default)

"routing" : {
        "use_adaptive_replica_selection" : "true",
        "rebalance" : {
          "enable" : "all"
        },
        "allocation" : {
          "node_concurrent_incoming_recoveries" : "2",
          "node_initial_primaries_recoveries" : "4",
          "same_shard" : {
            "host" : "false"
          },
          "total_shards_per_node" : "-1",
          "shard_state" : {
            "reroute" : {
              "priority" : "NORMAL"
            }
          },
          "type" : "balanced",
          "disk" : {
            "threshold_enabled" : "true",
            "watermark" : {
              "flood_stage" : "95%",
              "high" : "90%",
              "low" : "85%",
              "enable_for_single_data_node" : "false"
            },
            "include_relocations" : "true",
            "reroute_interval" : "60s"
          },
          "awareness" : {
            "attributes" : [ ]
          },
          "balance" : {
            "index" : "0.55",
            "threshold" : "1.0",
            "shard" : "0.45"
          },
          "node_concurrent_outgoing_recoveries" : "2",
          "allow_rebalance" : "indices_all_active",
          "cluster_concurrent_rebalance" : "2",
          "node_concurrent_recoveries" : "2"
        }
      }

Ok, these settings look reasonable. Any chance you can try and reproduce this in 7.9.1 since there have been possibly-impactful changes in this area since 7.8.0?

Also can you share the JVM options with which this node was started? GET _nodes/jvm is the simplest way to get hold of this.

Your graph shows a heap size of over 90GB. Is that a cluster-wide total or is it just one node? If just one node, that's much larger than we recommend and test:

Set Xmx and Xms to no more than the threshold that the JVM uses for compressed object pointers

To be honest I'm not eager to upgrade to 7.9.1 cause I've seen info about the Lucene memory leak that affected 7.9.0 version.

The size of JVM heap is 96GB per one data node. And yes, I know that in documentation you recommend to not go higher than 32GB per node. But the main reasoning you use is to avoid JVM overhead of using non-compressed object pointers. I haven't realized any official info, that there are some BUGs in ES that causes problems with larger heaps.

JVM itself works fine and G1 GC has no problem with heaps around 100GB.

I am just trying to understand what exactly is Elasticsearch doing when creating new replicas and why it allocates unlimitted amount of memory.

JVM options follow

{
        "pid" : 25732,
        "version" : "14.0.1",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "14.0.1+7",
        "vm_vendor" : "AdoptOpenJDK",
        "bundled_jdk" : true,
        "using_bundled_jdk" : true,
        "start_time_in_millis" : 1600069159436,
        "mem" : {
          "heap_init_in_bytes" : 103079215104,
          "heap_max_in_bytes" : 103079215104,
          "non_heap_init_in_bytes" : 7667712,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 0
        },
        "gc_collectors" : [
          "G1 Young Generation",
          "G1 Old Generation"
        ],
        "memory_pools" : [
          "CodeHeap 'non-nmethods'",
          "Metaspace",
          "CodeHeap 'profiled nmethods'",
          "G1 Eden Space",
          "G1 Old Gen",
          "G1 Survivor Space",
          "CodeHeap 'non-profiled nmethods'"
        ],
        "using_compressed_ordinary_object_pointers" : "false",
        "input_arguments" : [
          "-Xshare:auto",
          "-Des.networkaddress.cache.ttl=60",
          "-Des.networkaddress.cache.negative.ttl=10",
          "-XX:+AlwaysPreTouch",
          "-Xss1m",
          "-Djava.awt.headless=true",
          "-Dfile.encoding=UTF-8",
          "-Djna.nosys=true",
          "-XX:-OmitStackTraceInFastThrow",
          "-XX:+ShowCodeDetailsInExceptionMessages",
          "-Dio.netty.noUnsafe=true",
          "-Dio.netty.noKeySetOptimization=true",
          "-Dio.netty.recycler.maxCapacityPerThread=0",
          "-Dio.netty.allocator.numDirectArenas=0",
          "-Dlog4j.shutdownHookEnabled=false",
          "-Dlog4j2.disable.jmx=true",
          "-Djava.locale.providers=SPI,COMPAT",
          "-Xms96g",
          "-Xmx96g",
          "-XX:+UseG1GC",
          "-XX:MaxGCPauseMillis=300",
          "-XX:+ParallelRefProcEnabled",
          "-XX:G1ReservePercent=50",
          "-XX:InitiatingHeapOccupancyPercent=25",
          "-Djava.io.tmpdir=/tmp/elasticsearch-403851464928710223",
          "-XX:+HeapDumpOnOutOfMemoryError",
          "-XX:HeapDumpPath=/var/lib/elasticsearch",
          "-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log",
          "-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m",
          "-XX:MaxDirectMemorySize=51539607552",
          "-Des.path.home=/usr/share/elasticsearch",
          "-Des.path.conf=/etc/elasticsearch",
          "-Des.distribution.flavor=default",
          "-Des.distribution.type=deb",
          "-Des.bundled_jdk=true"
        ]
      }

The memory leak introduced in 7.9.0 was addressed in 7.9.1, no need to worry about that.

I'm not saying that there are any bugs related to larger heaps, just that we do not test this configuration so you are somewhat on your own here. This might well be an issue in Elasticsearch that affects all heap sizes but which becomes more impactful with oversized heaps since larger heaps take correspondingly more effort to GC.

JVM options also look ok to me. Could you share the GC logs from the time of the problem?

The only truly reliable way to answer your question is to take a heap dump while the heap usage is high and analyse it to see what is taking up space that shouldn't be there. I'm hoping there are some clues elsewhere since heap dump analysis is pretty heavyweight, but unfortunately this doesn't seem to be any of the usual candidates so guesswork isn't helping.

1 Like

I appreciate your attitude and I'll gladly try to be as helpfull as possible.

For now, I found the appropriate part of GC log. There is just +2h offset on the graph because of our timezone. So 14:00 on the graph above is 12:00 in this log file.

I guess all the fun starts around:

[2020-09-14T12:06:53.426+0000][20687][gc,start    ] GC(842) Pause Young (Concurrent Start) (G1 Humongous Allocation)

If you want me to induce the situation again and create a heap dump, please tell me, what command I should exactly execute on the node to get something that will be useful.

Thanks. Hmm, there are a very large number of (transient) humongous allocations at around the time of the spike in heap usage:

$ cat gc.node08_partial.log | grep -e 'Humongous regions' | less
...
[2020-09-14T12:04:23.056+0000][20687][gc,heap     ] GC(836) Humongous regions: 81->70
[2020-09-14T12:04:46.512+0000][20687][gc,heap     ] GC(837) Humongous regions: 76->70
[2020-09-14T12:05:19.159+0000][20687][gc,heap     ] GC(838) Humongous regions: 81->71
[2020-09-14T12:05:53.293+0000][20687][gc,heap     ] GC(839) Humongous regions: 77->69
[2020-09-14T12:06:18.852+0000][20687][gc,heap     ] GC(840) Humongous regions: 74->71
[2020-09-14T12:06:44.967+0000][20687][gc,heap     ] GC(841) Humongous regions: 185->72
[2020-09-14T12:06:53.487+0000][20687][gc,heap     ] GC(842) Humongous regions: 309->71
[2020-09-14T12:07:56.721+0000][20687][gc,heap     ] GC(844) Humongous regions: 1766->63
[2020-09-14T12:08:18.573+0000][20687][gc,heap     ] GC(845) Humongous regions: 668->64
[2020-09-14T12:08:18.617+0000][20687][gc,heap     ] GC(846) Humongous regions: 64->63
[2020-09-14T12:09:03.738+0000][20687][gc,heap     ] GC(848) Humongous regions: 1326->64
[2020-09-14T12:09:32.791+0000][20687][gc,heap     ] GC(849) Humongous regions: 844->65
[2020-09-14T12:09:50.836+0000][20687][gc,heap     ] GC(850) Humongous regions: 548->65
[2020-09-14T12:10:03.531+0000][20687][gc,heap     ] GC(851) Humongous regions: 401->63
[2020-09-14T12:10:04.278+0000][20687][gc,heap     ] GC(853) Humongous regions: 82->62
[2020-09-14T12:10:04.989+0000][20687][gc,heap     ] GC(855) Humongous regions: 82->62
[2020-09-14T12:10:05.689+0000][20687][gc,heap     ] GC(857) Humongous regions: 82->63
[2020-09-14T12:10:06.415+0000][20687][gc,heap     ] GC(859) Humongous regions: 83->62
[2020-09-14T12:10:07.153+0000][20687][gc,heap     ] GC(861) Humongous regions: 82->62
[2020-09-14T12:10:07.877+0000][20687][gc,heap     ] GC(863) Humongous regions: 82->63
...

Bear with me, I'll ask around about this.

This is how G1 works more or less by design. The humongous allocations do play a part in it and it could be interesting to take a quick look at a heap dump during such a heap rise. Would you be able/permitted to take a heap dump and send to me (I can provide an upload link in a PM)?

G1 does not really collect concurrently, it only does some of the phases concurrently (marking and more). When a concurrent cycle ran, it typically takes another young GC to actually collect - which might be many GBs of allocation on your large heap.

However, with respect to the real memory circuit breaker we worked around this in #58674 and the solution to that part of the issue is to upgrade to 7.9.1+.

1 Like

Yes, I can take a heap dump but I would prefere to have proper instructions how to do that. I could use Google but I want to be sure that the result will be useful for your purposes.

OS[Linux/4.15.0-112-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/14.0.1/14.0.1+7]

I am glad to hear that the problem decribed in #58674 is solved in 7.9.1+ because that was the root cause why I started to analyze the node behaviour in detail.

With real memory circuit breaker enabled it was pure lotery, when I added more replicas to my indexes, if our cluster stayed alive or would go down :frowning:

So I disabled this option (and it solved the GC problem) but I am still curious why there is such a big amount of memory allocation while adding new replicas and thats why I started this topic.

Ugh :slight_smile:

I tried to induce the same situation just by adding more replicas to existing indexes but the problem didn't occur. Now I am trying to figure out what is the difference because in the past I was facing the problem on regular basis.

One theory I have is, that the problem occurs only when I add replica to newly created and filled indexes (using _reindex). Cause what I was doing last few weeks was exactly this:

  1. Create new index with replica count 0 and refresh interval to 60s
  2. Run _reindex command to migrate data from an old cluster
  3. Change replica count to 1 and refresh interval to 1s

And each time at point 3. I was facing the problem with memory. The main difference from what I see in Kibana management now is, that there was definitely something in Translog column (now, there is just n/a). But it would make sense if the situation had to be so specific, because if it would not, more users would have faced the same problem.

So bear with me, cause it will take some time to create appropriate testing data.
I will let you know when I have some relevant memory dumps.

I am giving up, sorry :frowning:

I tried four times similar steps which led in the past to the problematic memory consumption peak. But something has to be different now and I am unable to find what it is. So I have no memory dump.

Problem is that the cluster is our new production cluster and I cannot start with data migration from the very beginning to try to prepare exactly same conditions.

There definitely is something but probably more specific than I have thought.

Thanks @zdeseb for trying to reproduce this. It does make sense that it is at least not a pure file-based recovery triggering this.