G1GC cause CircuitBreakingException: [parent] Data too large on 7.1.1

Elasticsearch version ( bin/elasticsearch --version ):
7.1.1

Plugins installed :
none

JVM version ( java -version ):
openjdk version "11.0.3" 2019-04-16 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.3+7-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.3+7-LTS, mixed mode, sharing)

OS version ( uname -a if on a Unix-like system):
Linux ************ 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior :

When set JVM GC to G1, the shards become unstable.

Steps to reproduce :
Create a empty elasticsearch cluster use following config:

jvm.options

-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=75
-Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10
-XX:+AlwaysPreTouch
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-XX:-OmitStackTraceInFastThrow
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/elasticsearch
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
-Djava.locale.providers=COMPAT

elasticsearch.yml

cluster.name: **** 
node.name: ${HOSTNAME} 
path.data: "/var/lib/elasticsearch" 
path.logs: "/var/log/elasticsearch" 
xpack.monitoring.collection.enabled: true 
xpack.security.enabled: true 
xpack.security.transport.ssl.enabled: true 
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: "/etc/elasticsearch/elastic-certificates.p12"
xpack.security.transport.ssl.truststore.path: "/etc/elasticsearch/elastic-certificates.p12"
cluster.initial_master_nodes: ...
discovery.seed_hosts: ...

The xpack monitoring bulk write some data and cause error.

Provide logs (if relevant) :

[2019-06-25T13:51:07,806][DEBUG][o.e.a.a.c.s.TransportClusterStatsAction] [*************] failed to execute on node [Xcbc02xwRbaPToodXWpjhw]
org.elasticsearch.transport.RemoteTransportException: [*************][100.73.138.216:9300][cluster:monitor/stats[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [1936852044/1.8gb], which is larger than the limit of [1878733619/1.7gb], real usage: [1936848752/1.8gb], new bytes reserved: [3292/3.2kb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1026) [elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:922) [elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:753) [elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) [transport-netty4-client-7.1.1.jar:7.1.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1436) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at 
....

@mvg Could you check this?

Hey @Wing924, please, avoid pinging people directly. Read this and specifically the "Also be patient" part.

It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.

I'm sorry but he said to ping him here.

1 Like

Hey @Wing924 can you perhaps give more information about your cluster?
(maybe share cluster stats api output, so that we have a better understanding what is in this cluster)

Also what is the reasoning of changing jvm gc settings? These are expert settings and in general tweaking these settings does more harm than any good. Usually the issue is memory pressure and adding more memory or more nodes solves the problem.

Hi @mvg
Thank you for checking this trouble.

what is the reasoning of changing jvm gc settings?

I read Garbage Collection in Elasticsearch and the G1GC and it suggest G1GC is faster than CMS.

And I found those settings this in jvm.options so I just replace the CMS settings with G1 settings.

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
# 10-:-XX:-UseConcMarkSweepGC
# 10-:-XX:-UseCMSInitiatingOccupancyOnly
# 10-:-XX:+UseG1GC
# 10-:-XX:InitiatingHeapOccupancyPercent=75

Here is the cluster stats

Note: all 6 nodes are running on 1 vCPU 4GB memory VMs

curl http://************:9200/_cluster/stats

{
  "_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "cluster_name": "***********",
  "cluster_uuid": "IvNoCNlaSx2wM9BUAeYBNg",
  "timestamp": 1561602797644,
  "status": "yellow",
  "indices": {
    "count": 19,
    "shards": {
      "total": 35,
      "primaries": 19,
      "replication": 0.8421052631578947,
      "index": {
        "shards": {
          "min": 1,
          "max": 2,
          "avg": 1.8421052631578947
        },
        "primaries": {
          "min": 1,
          "max": 1,
          "avg": 1.0
        },
        "replication": {
          "min": 0.0,
          "max": 1.0,
          "avg": 0.8421052631578947
        }
      }
    },
    "docs": {
      "count": 1757890,
      "deleted": 153278
    },
    "store": {
      "size_in_bytes": 1112251559
    },
    "fielddata": {
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 231,
      "memory_in_bytes": 4087030,
      "terms_memory_in_bytes": 1873471,
      "stored_fields_memory_in_bytes": 305904,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 50240,
      "points_memory_in_bytes": 303211,
      "doc_values_memory_in_bytes": 1554204,
      "index_writer_memory_in_bytes": 0,
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set_memory_in_bytes": 997184,
      "max_unsafe_auto_id_timestamp": 1561602684422,
      "file_sizes": {
        
      }
    }
  },
  "nodes": {
    "count": {
      "total": 6,
      "data": 3,
      "coordinating_only": 0,
      "master": 3,
      "ingest": 6
    },
    "versions": [
      "7.1.1"
    ],
    "os": {
      "available_processors": 12,
      "allocated_processors": 12,
      "names": [
        {
          "name": "Linux",
          "count": 6
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "CentOS Linux 7 (Core)",
          "count": 6
        }
      ],
      "mem": {
        "total_in_bytes": 23725092864,
        "free_in_bytes": 1660436480,
        "used_in_bytes": 22064656384,
        "free_percent": 7,
        "used_percent": 93
      }
    },
    "process": {
      "cpu": {
        "percent": 95
      },
      "open_file_descriptors": {
        "min": 312,
        "max": 498,
        "avg": 392
      }
    },
    "jvm": {
      "max_uptime_in_millis": 133466,
      "versions": [
        {
          "version": "11.0.3",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "11.0.3+7-LTS",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": false,
          "count": 6
        }
      ],
      "mem": {
        "heap_used_in_bytes": 4229741568,
        "heap_max_in_bytes": 11865686016
      },
      "threads": 251
    },
    "fs": {
      "total_in_bytes": 187414388736,
      "free_in_bytes": 164492333056,
      "available_in_bytes": 156136689664
    },
    "plugins": [
      
    ],
    "network_types": {
      "transport_types": {
        "security4": 6
      },
      "http_types": {
        "security4": 6
      }
    },
    "discovery_types": {
      "zen": 6
    }
  }
}

Thanks for sharing this information.

In the gh issue you mentioned that this was an empty cluster, right?
However there appears to be data in your cluster. From the number of documents (1.7M+) this doesn't seem to be just monitor documents.

The circuit breaker error occurred on the transport layer, which suggests that either too many or a few large requests are being sent to this cluster. Are you by any chance sending large bulk requests?

Also I think it is unlikely that this error occurred because g1gc is used. The circuit breaker errors are thrown by Elasticsearch (to avoid OOM and allocating more memory than is available) and not by the jvm as is explained here: https://www.elastic.co/guide/en/elasticsearch/reference/7.1/circuit-breaker.html

Can you perhaps share your circuit breaker stats? This can be retrieved from the node stats api:
https://www.elastic.co/guide/en/elasticsearch/reference/7.1/cluster-nodes-stats.html
(under breaker section)

I cleared all the data by removing /var/lib/elasticseach and re-setup the cluster.
The trouble happened again in less than 1 hour.
The cluster only have monitoring docs.

This is cluster stats:

{
  "_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "cluster_name": "*******",
  "cluster_uuid": "KCku0RuiR8y2vVo27i2qpw",
  "timestamp": 1561645770461,
  "status": "yellow",
  "indices": {
    "count": 1,
    "shards": {
      "total": 1,
      "primaries": 1,
      "replication": 0.0,
      "index": {
        "shards": {
          "min": 1,
          "max": 1,
          "avg": 1.0
        },
        "primaries": {
          "min": 1,
          "max": 1,
          "avg": 1.0
        },
        "replication": {
          "min": 0.0,
          "max": 0.0,
          "avg": 0.0
        }
      }
    },
    "docs": {
      "count": 3836,
      "deleted": 752
    },
    "store": {
      "size_in_bytes": 2570332
    },
    "fielddata": {
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 10,
      "memory_in_bytes": 180618,
      "terms_memory_in_bytes": 47618,
      "stored_fields_memory_in_bytes": 3384,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 0,
      "points_memory_in_bytes": 3616,
      "doc_values_memory_in_bytes": 126000,
      "index_writer_memory_in_bytes": 0,
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set_memory_in_bytes": 1040,
      "max_unsafe_auto_id_timestamp": 1561641638909,
      "file_sizes": {
        
      }
    }
  },
  "nodes": {
    "count": {
      "total": 6,
      "data": 3,
      "coordinating_only": 0,
      "master": 3,
      "ingest": 6
    },
    "versions": [
      "7.1.1"
    ],
    "os": {
      "available_processors": 12,
      "allocated_processors": 12,
      "names": [
        {
          "name": "Linux",
          "count": 6
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "CentOS Linux 7 (Core)",
          "count": 6
        }
      ],
      "mem": {
        "total_in_bytes": 23725092864,
        "free_in_bytes": 2463821824,
        "used_in_bytes": 21261271040,
        "free_percent": 10,
        "used_percent": 90
      }
    },
    "process": {
      "cpu": {
        "percent": 17
      },
      "open_file_descriptors": {
        "min": 312,
        "max": 327,
        "avg": 316
      }
    },
    "jvm": {
      "max_uptime_in_millis": 4151721,
      "versions": [
        {
          "version": "11.0.3",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "11.0.3+7-LTS",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": false,
          "count": 6
        }
      ],
      "mem": {
        "heap_used_in_bytes": 4109397936,
        "heap_max_in_bytes": 11865686016
      },
      "threads": 209
    },
    "fs": {
      "total_in_bytes": 187414388736,
      "free_in_bytes": 166808989696,
      "available_in_bytes": 158453346304
    },
    "plugins": [
      
    ],
    "network_types": {
      "transport_types": {
        "security4": 6
      },
      "http_types": {
        "security4": 6
      }
    },
    "discovery_types": {
      "zen": 6
    }
  }
}

And I found those settings this in jvm.options so I just replace the CMS settings with G1 settings.

Did you use all the settings in this section or just selectively copied -XX:+UseG1GC? If not, please try again by setting everything that's mentioned under ## G1GC Configuration

What I changed from original are:

  • -Xms and -Xmx
  • uncommented G1GC Configuration

diff

--- /var/chef/backup/etc/elasticsearch/jvm.options.chef-20190624095904.766405	2019-05-23 23:15:00.000000000 +0900
+++ jvm.options	2019-06-28 09:09:00.467975394 +0900
@@ -19,10 +19,8 @@
 # Xms represents the initial size of total heap space
 # Xmx represents the maximum size of total heap space

--Xms1g
--Xmx1g
+#-Xms1g
+#-Xmx1g
+-Xms1885m
+-Xmx1885m

 ################################################################
 ## Expert settings
@@ -42,10 +40,10 @@
 ## G1GC Configuration
 # NOTE: G1GC is only supported on JDK version 10 or later.
 # To use G1GC uncomment the lines below.
-# 10-:-XX:-UseConcMarkSweepGC
-# 10-:-XX:-UseCMSInitiatingOccupancyOnly
-# 10-:-XX:+UseG1GC
-# 10-:-XX:InitiatingHeapOccupancyPercent=75
+10-:-XX:-UseConcMarkSweepGC
+10-:-XX:-UseCMSInitiatingOccupancyOnly
+10-:-XX:+UseG1GC
+10-:-XX:InitiatingHeapOccupancyPercent=75

 ## DNS cache policy
 # cache ttl in seconds for positive DNS lookups noting that this overrides the

raw jvm.options

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

#-Xms1g
#-Xmx1g
-Xms1885m
-Xmx1885m

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:InitiatingHeapOccupancyPercent=75

## DNS cache policy
# cache ttl in seconds for positive DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.ttl; set to -1 to cache forever
-Des.networkaddress.cache.ttl=60
# cache ttl in seconds for negative DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.negative ttl; set to -1 to cache
# forever
-Des.networkaddress.cache.negative.ttl=10

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

This looks to be a different configuration than the one in the opening post? That one did not have UseCMSInitiatingOccupancyOnly or UseConcMarkSweepGC for example.

This looks to be a different configuration than the one in the opening post? That one did not have UseCMSInitiatingOccupancyOnly or UseConcMarkSweepGC for example.

I tested on both config, there are no difference.
10-:-XX:-UseCMSInitiatingOccupancyOnly disable -XX:+UseCMSInitiatingOccupancyOnly
and 10-:-XX:-UseConcMarkSweepGC disable -XX:+UseConcMarkSweepGC.

The only difference is 10-:-XX:InitiatingHeapOccupancyPercent=75.
It defaults to 45, but both 45 and 75 don't have any difference.

@mvg @ywelsch Have you ever reproduced this issue? Do I need to add more info?

I've tried reproducing this using Rally, but couldn't.

Do you encounter the same problem if you use the bundled JDK?

Can you continuously take the node stats during a run and share those once you're close to hitting CircuitBreakingException?

Does this state persist for long or is it just a short transient state?

Can you take a heap dump once your node is in this state and share that?

Do you encounter the same problem if you use the bundled JDK?

I changed to default JDK but the problem still happens.

$ sudo systemctl status elasticsearch -l
● elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-07-12 17:44:10 JST; 20min ago
     Docs: http://www.elastic.co
 Main PID: 22087 (java)
   CGroup: /system.slice/elasticsearch.service
           ├─22087 /usr/share/elasticsearch/jdk/bin/java -Xms1885m -Xmx1885m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=75 -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-7694422566437407641 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Dio.netty.allocator.type=pooled -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
           └─22173 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

Jul 12 17:44:10 ********* systemd[1]: Started Elasticsearch.
Jul 12 17:44:11 ********* elasticsearch[22087]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Jul 12 17:44:11 ********* elasticsearch[22087]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Jul 12 17:44:12 ********* elasticsearch[22087]: [2019-07-12T17:44:12,878][WARN ][o.e.c.l.LogConfigurator  ] [*********] Some logging configurations have %marker but don't have %node_name. We will automatically add %node_name to the pattern to ease the migration for users who customize log4j2.properties but will stop this behavior in 7.0. You should manually replace `%node_name` with `[%node_name]%marker ` in these locations:
Jul 12 17:44:12 ********* elasticsearch[22087]: /etc/elasticsearch/log4j2.properties

Can you continuously take the node stats during a run and share those once you're close to hitting CircuitBreakingException?

I'll post it later

Does this state persist for long or is it just a short transient state?

it keeps for long

Can you take a heap dump once your node is in this state and share that?

Could you tell me how to?

Can you continuously take the node stats during a run and share those once you're close to hitting CircuitBreakingException?

Please download here:

https://drive.google.com/open?id=1ML98nZtsSHsH6q0YWbYC4wUTDlnF8ZKF

You've unfortunately provided the cluster stats, not the node stats.

I will need the node stats to have a look at the breaker usage.

Instructions on how to take a heap dump are here. Note that the node stats (requested above) might already give us a good idea where memory is being consumed.

@ywelsch
Sorry for providing wrong info.
This is node stats: https://drive.google.com/file/d/1isFk8jeSaaipfSKd4QfCJ4gihagErOaS/view?usp=sharing

About the heap dump, I want to send to you by email because it may contain some sensitive info.

I've had a look at the node stats, which are not showing any significant memory usage by any of the breakers (quite puzzling). For the heap dump, you can send me a download link either via direct message here on discuss or to my e-mail yannick AT elastic co

which are not showing any significant memory usage by any of the breakers (quite puzzling)

Yes, because I only put monitoring data to cluster, the memory should not be shortage.
I think it's false positive.

And I sent you the heapdump via email.