ES 6.8 vs 7.4 memory issues

andrx · October 9, 2019, 2:33pm

The following is not a part of aws elasticsearch managed service.
I have ES6.8 cluster on m4.2xlarge (32GB RAM) centos7 machines on aws.

GET _cat/nodes?v&s=name
ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.1x.x.x31           64          99  10    1.32    1.25     1.24 di        -      data-1
10.1x.x.x6            30          99   6    0.72    0.80     0.84 di        -      data-2
10.1x.x.x34           68          99  36    1.12    1.08     1.18 di        -      data-3
10.1x.x.x03           49          99  17    1.36    1.40     1.40 di        -      data-4
10.1x.x.x33           44          99  49    1.54    1.68     1.67 di        -      data-5
10.1x.x.x10           44          99  13    1.26    1.45     1.57 di        -      data-6
10.1x.x.x8            32          99   8    1.39    1.17     1.17 di        -      data-7
10.1x.x.x7            43          71   2    0.42    0.31     0.26 mi        *      master-3

GET _cat/allocation?v&s=node
shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
347       58.5gb    65.5gb    958.4gb   1023.9gb            6 10.1x.x.x31 10.1x.x.x31 data-1
240       42.3gb    48.1gb    975.8gb   1023.9gb            4 10.1x.x.x6  10.1x.x.x6  data-2
304         55gb    61.6gb    962.2gb   1023.9gb            6 10.1x.x.x34 10.1x.x.x34 data-3
382         57gb    64.4gb    959.5gb   1023.9gb            6 10.1x.x.x03 10.1x.x.x03 data-4
391       60.1gb    67.1gb    956.8gb   1023.9gb            6 10.1x.x.x33 10.1x.x.x33 data-5
391       55.8gb    63.1gb    960.8gb   1023.9gb            6 10.1x.x.x10 10.1x.x.x10 data-6
287       49.3gb    59.4gb    964.5gb   1023.9gb            5 10.1x.x.x8  10.1x.x.x8  data-7

we do bulk inserts/bulk updates on it every morning and searching during the day.
it's pretty stable. I set it up 4 months ago using official ansible role. So only elasticsearch and datadog agent is installed there on top of default centos7 image.

node.ml: false
bootstrap.memory_lock: true
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 10000
indices.queries.cache.size: 5%
es_heap_size: 20g <-- !!! m4.2xlarge has 32GB RAM

No issues with ES6.8 at all, no timeouts, no downtime since the cluster started.

As a part of migration to ES7 I started a new cluster with the same ansible role, same configs, same hardware.
everything is the same.
Initial load is usually pretty high but no issues happened as expected.
Then the second process that sort of merges some documents using aggregation queries. Got an error. The culprit was the size of buckets is set to 2000000 (different topic why it's set to this number). Though only a small portion returned back due to bucket_selector pipeline agg.
Example with nested query which is a bit more complex than aggs without nested we have:

GET index_id1/_search
{
  "size": 0,
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "sameIds": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": "return doc['nestedObjects.id'].value"
            },
            "size": 2000000  <-- too big
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "totalCount": "byId._count"
                },
                "script": {
                  "source": "params.totalCount > 1"
                }
              }
            }
          }
        }
      }
    }
  }
}

Before adjusting the client and switching to composite aggregation I wanted to let it finish with the test next day and changed "search.max_buckets" to 2000000. The rest is the same as on ES6.8
Next day I found out that although the test es7 cluster hadn't been used at all, on 4 nodes elasticsearch service died. checked RAM and looked like this. jvm ate up everything:

Restarted the service on all nodes with es_heap_size 50% of ram = 16g and continued the process with aggregations. Got this:

type: circuit_breaking_exception Reason: "[parent] Data too large, data for [<http_request>] would be [16518949320/15.3gb], which is larger than the limit of [16254631936/15.1gb], real usage: [16518948880/15.3gb], new bytes reserved: [440/440b], usages [request=16440/16kb, fielddata=317/317b, in_flight_requests=154603580/147.4mb, accounting=540721543/515.6mb]"

Ok, found out that between ES7.4 and 6.8 the difference is that the parent breaker takes real memory usage into account. Set it to indices.breaker.total.use_real_memory: false as on ES6.8 there is no such thing.
Ran the process again - no more issues.
So the only difference in terms of cluster config is that ES6.8 es_heap_size = 20g (or 63%) vs ES7.4 es_heap_size = 16g ( or 50%).
Next 10 hrs nobody touched the cluster though i see this pattern:
Memory graph for ES7 for the past 10hrs NOT being used:

Memory graph for ES6 for the whole week being used:

I understand the issue with the aggregation and as I mentioned will switch to composite to get buckets - different topic.

My question is why ES7 RAM consumption pattern is so different than on ES6.8 (same configs/envs/hardware)? I see these zig-zags on ES7 but it's growing. After every clean up consumed memory is higher then it was after previous cleanup. Should I expect it to consume the RAM again and die? Maybe there is anything to track what is exactly causing it grow while not used?
Is there anything else I could adjust before diving into aggregations update? Maybe GC settings or..?
Also why switching off real memory usage for the parent breaker did the trick?

andrx · November 2, 2019, 3:36am

Updates:

switched to ES 7.3.2 - no issues. the same memory pattern as on ES6.8.
tried ES 7.4.2. Looks same problem as with 7.4.0.
Created a new cluster used it heavily for 5hrs
Next 4hrs memory was growing though cluster was not used

data4 system memory (all nodes have pretty much the same pattern around the same time):

Logs from the cluster since stopped using it:

date,Host,Service,message
2019-11-02T02:21:00.000Z,test-es7-master-1,elasticsearch,"[2019-11-02T02:21:00,000][INFO ][o.e.x.m.MlDailyMaintenanceService] [master-1] triggering scheduled [ML] maintenance tasks"
2019-11-02T02:21:00.010Z,test-es7-master-1,elasticsearch,"[2019-11-02T02:21:00,010][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [master-1] Deleting expired data"
2019-11-02T02:21:00.026Z,test-es7-master-1,elasticsearch,"[2019-11-02T02:21:00,026][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [master-1] Completed deletion of expired ML data"
2019-11-02T02:21:00.026Z,test-es7-master-1,elasticsearch,"[2019-11-02T02:21:00,026][INFO ][o.e.x.m.MlDailyMaintenanceService] [master-1] Successfully completed [ML] maintenance tasks"
2019-11-02T02:42:28.962Z,test-es7-data-5,elasticsearch,"[2019-11-02T02:42:28,962][INFO ][o.e.m.j.JvmGcMonitorService] [data-5] [gc][old][29532][14] duration [6.9s], collections [1]/[7.4s], total [6.9s]/[7.8s], memory [8gb]->[2.3gb]/[15.9gb], all_pools {[young] [429mb]->[6.7mb]/[532.5mb]}{[survivor] [66.5mb]->[0b]/[66.5mb]}{[old] [7.5gb]->[2.3gb]/[15.3gb]}"
2019-11-02T02:42:29.000Z,test-es7-data-5,elasticsearch,"[2019-11-02T02:42:29,000][WARN ][o.e.m.j.JvmGcMonitorService] [data-5] [gc][29532] overhead, spent [6.9s] collecting in the last [7.4s]"
2019-11-02T02:45:41.014Z,test-es7-data-1,elasticsearch,"[2019-11-02T02:45:41,014][INFO ][o.e.m.j.JvmGcMonitorService] [data-1] [gc][old][29726][14] duration [6.4s], collections [1]/[6.8s], total [6.4s]/[7.4s], memory [7.9gb]->[2.5gb]/[15.9gb], all_pools {[young] [398mb]->[11.2mb]/[532.5mb]}{[survivor] [29mb]->[0b]/[66.5mb]}{[old] [7.5gb]->[2.5gb]/[15.3gb]}"
2019-11-02T02:45:41.080Z,test-es7-data-1,elasticsearch,"[2019-11-02T02:45:41,080][WARN ][o.e.m.j.JvmGcMonitorService] [data-1] [gc][29726] overhead, spent [6.4s] collecting in the last [6.8s]"
2019-11-02T02:56:30.021Z,test-es7-data-7,elasticsearch,"[2019-11-02T02:56:30,021][INFO ][o.e.m.j.JvmGcMonitorService] [data-7] [gc][old][30368][14] duration [6.8s], collections [1]/[7.1s], total [6.8s]/[7.9s], memory [8.2gb]->[2.6gb]/[15.9gb], all_pools {[young] [418.2mb]->[8.4mb]/[532.5mb]}{[survivor] [50mb]->[0b]/[66.5mb]}{[old] [7.8gb]->[2.6gb]/[15.3gb]}"
2019-11-02T02:56:30.050Z,test-es7-data-7,elasticsearch,"[2019-11-02T02:56:30,050][WARN ][o.e.m.j.JvmGcMonitorService] [data-7] [gc][30368] overhead, spent [6.8s] collecting in the last [7.1s]"
2019-11-02T03:01:58.702Z,test-es7-data-4,elasticsearch,"[2019-11-02T03:01:58,702][INFO ][o.e.m.j.JvmGcMonitorService] [data-4] [gc][old][30699][14] duration [6.8s], collections [1]/[7.6s], total [6.8s]/[7.7s], memory [9.5gb]->[2.5gb]/[15.9gb], all_pools {[young] [457.1mb]->[5.1mb]/[532.5mb]}{[survivor] [66.5mb]->[0b]/[66.5mb]}{[old] [9gb]->[2.5gb]/[15.3gb]}"
2019-11-02T03:01:59.055Z,test-es7-data-4,elasticsearch,"[2019-11-02T03:01:59,055][WARN ][o.e.m.j.JvmGcMonitorService] [data-4] [gc][30699] overhead, spent [6.8s] collecting in the last [7.6s]"

Huge numbers from GC. Could anyone suggest please what else could I check to find out the cause? maybe gc logs could help?

Armin_Braun · November 2, 2019, 7:39pm

@andrx

Could you share your jvm.options file used with 7.4? What I'm looking for in particular is whether or not it contains the important (became important and default in 7.4) -Dio.netty.allocator.numDirectArenas=0 option.

andrx · November 3, 2019, 2:15am

Hey @Armin_Braun ,
Thank you for help. I checked the file:

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms16g
-Xmx16g
			   

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# force the server VM
-server

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# use old-style file permissions on JDK9
-Djdk.io.permissionsUseCanonicalPath=true
							  

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
									  

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true

							 

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps
# ensure the directory exists and has sufficient space
#-XX:HeapDumpPath=${heap.dump.path}

## GC logging
			 

#-XX:+PrintGCDetails
#-XX:+PrintGCTimeStamps
#-XX:+PrintGCDateStamps
#-XX:+PrintClassHistogram
#-XX:+PrintTenuringDistribution
#-XX:+PrintGCApplicationStoppedTime

# log GC status to a file with time stamps
# ensure the directory exists
								
									
#-Xloggc:${loggc}
#-Xloggc:/opt/elasticsearch/logs/gc.log
						   
					   

# By default, the GC log file will not rotate.
# By uncommenting the lines below, the GC log file
# will be rotated every 128MB at most 32 times.
#-XX:+UseGCLogFileRotation
#-XX:NumberOfGCLogFiles=32
#-XX:GCLogFileSize=128M

Compared it to what is currently on 7.4 repo
I have a feeling that I need to create a ticket on github.com/elastic/ansible-elasticsearch board. Plenty of changes. I missed those.

I'll continue testing with these on Monday and will let you know. Thanks again!

I see changes for G1GC there (using bundled jdk). In your opinion worth to try it out? I remember other posts mentioning data loss issues.

Armin_Braun · November 3, 2019, 10:45am

Thanks for posting those jvm.options @andrx. The fact that you're missing
-Dio.netty.allocator.numDirectArenas=0 likely explains a good part if not all of your memory issues.

I see changes for G1GC there (using bundled jdk). In your opinion worth to try it out?

Yes I think so but I wouldn't spend too much time one it either. Eventually G1GC will become the new standard GC so moving to it has its advantages in terms of long term maintainability. Also, in some cases (particularly when using large heap sizes) G1GC has shown to improve performance though that might not apply here with 16G heap size.

I remember other posts mentioning data loss issues.

There were issues with G1GC and some older Java versions but nowadays all should be fine here. We also do have bootstrap checks in place that will prevent ES from starting in case you're running one of these older versions (at this point, those are really old so that's unlikely) and G1GC.

andrx · November 5, 2019, 4:34am

Hey @Armin_Braun,

Tested today. It worked. RAM pattern is now the same as on ES7.3. Thank you!
Created a ticket for ansible team on jvm.options.

I also tried to switch back to real memory for parent breaker - basically removed indices.breaker.total.use_real_memory: false from config file. Loaders that update data threw the same error though retry logic worked and no processes failed but still I don't understand what I am doing wrong.

2019-11-04T21:27:03.657Z	data-1	data-1 [2019-11-04T21:27:03,657][WARN ][o.e.a.b.TransportShardBulkAction] [data-1] [[index_20191101_00000][0]] failed to perform indices:data/write/bulk[s] on replica [index_20191101_00000][0], node[h6BDlw3wQF-3o4w_v9p7ig], [R], s[STARTED], a[id=lfsR7_vFSHmSS0mRYuxf5g]
2019-11-04T21:27:03.657Z	data-1	org.elasticsearch.transport.RemoteTransportException: [data-5][10.15.1.197:9300][indices:data/write/bulk[s][r]]
2019-11-04T21:27:03.657Z	data-1	Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [16344013982/15.2gb], which is larger than the limit of [16254631936/15.1gb], real usage: [16338399280/15.2gb], new bytes reserved: [5614702/5.3mb], usages [request=16440/16kb, fielddata=626/626b, in_flight_requests=2246901906/2gb, accounting=512748659/488.9mb]

Maybe you could suggest what to check or adjust the process somehow?

Thank you in advance

andrx · November 5, 2019, 4:34am

Full log+stacktrace of such exceptions:

2019-11-04T21:26:22.769Z	data-1	[2019-11-04T21:26:22,769][INFO ][o.e.c.s.IndexScopedSettings] [data-1] updating [index.refresh_interval] from [30s] to [-1]
2019-11-04T21:26:24.154Z	data-1	[2019-11-04T21:26:24,154][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [data-1] failed to execute on node [h6BDlw3wQF-3o4w_v9p7ig]
2019-11-04T21:26:24.745Z	data-1	org.elasticsearch.transport.RemoteTransportException: [data-5][10.15.1.197:9300][cluster:monitor/nodes/stats[n]]
2019-11-04T21:26:24.745Z	data-1	Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [16298780954/15.1gb], which is larger than the limit of [16254631936/15.1gb], real usage: [16298776880/15.1gb], new bytes reserved: [4074/3.9kb], usages [request=0/0b, fielddata=626/626b, in_flight_requests=2335756912/2.1gb, accounting=505196303/481.7mb]
2019-11-04T21:26:24.745Z	data-1	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.4.2.jar:7.4.2]
2019-11-04T21:26:24.745Z	data-1	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.4.2.jar:7.4.2]
2019-11-04T21:26:24.745Z	data-1	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:170) [elasticsearch-7.4.2.jar:7.4.2]
2019-11-04T21:26:24.745Z	data-1	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-7.4.2.jar:7.4.2]
2019-11-04T21:26:24.745Z	data-1	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
2019-11-04T21:26:24.745Z	data-1	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
2019-11-04T21:26:24.745Z	data-1	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
2019-11-04T21:26:24.745Z	data-1	at java.lang.Thread.run(Thread.java:830) [?:?]

Armin_Braun · November 5, 2019, 6:43am

@andrx

there's not necessarily anything you're doing wrong that leads to running into the real memory circuit breaker. There will simply be a level of indexing load under which 6.8 was barely hanging on but 7.4 will be interrupted by the circuit breaker to protect the integrity of the system.

What garbage collection settings are you seeing this issue with currently, CMS or G1GC?

andrx · November 5, 2019, 12:54pm

@Armin_Braun as you suggested i left the default one for now: CMS.

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

Armin_Braun · November 5, 2019, 1:03pm

I'm afraid I don't have any direct solution suggestions for CMS here.

Are the circuit breaker exceptions a constant occurrence in the current state? In case you tested it: If you disable the real memory circuit breaker, is everything stable or are nodes running out of memory?

andrx · November 5, 2019, 1:45pm

I've been running 7.3.2 with the same settings (except for indices.breaker.total.use_real_memory: false added) for three weeks - no issues.
I think I'll give ES7.4 a try with G1GC just to make sure and have a comparison of both.
Will post updates.
Thanks again!

andrx · November 8, 2019, 9:36pm

So, I tried G1GC and feel that will switch to it. The RAM pattern was good. the numbers in gc.log better and less interruptions from parent circuit breaker.

Thank you again for help!

system · December 6, 2019, 9:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch gc overhead Elasticsearch docker	2	545	June 23, 2021
Es instance unusable Elasticsearch	7	737	October 12, 2020
Elasticsearch Heap size growing with time and lot of GC, eventually pulling the cluster down Elasticsearch	6	2579	July 5, 2017
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2519	July 5, 2017
Is there any problem that set ES heap size to 64G? Elasticsearch	2	228	March 27, 2023

ES 6.8 vs 7.4 memory issues

Related topics