Hi Elasticians, I'm looking for help fixing an issue (or family of issues) that causes a Elasticsearch v7.8.1 cluster to go consistently to yellow state due to unassigned shard(s) that sometimes cannot "auto-heal". This issue is usually observed with CircuitBreakingException [parent] of a variety of reasons in the logs and, sometimes, trying to check on cluster state through REST APIS is not possible due to illegal_argument_exception caused by "Values less than -1 bytes are not supported" (see #42725).
An instance of this issue happened recently and I collected the possible diags from it; it was then auto fixed after a cluster restart, so I suspect once the memory was cleared, then Elasticsearch was able to assign the replicas and get back to green.
Cluster Setup
It has 13 Ubuntu 18.04.2 LTS nodes on AWS EC2 t2.2xlarge servers (8 vCPUs, 32 GiB RAM, 100 GiB SSD reserved for data), being:
- 3 master-eligible nodes, for proprietary GUI queries, aggregations and analytics
- in cluster.initial_master_nodeslist
- node.master: true
- 
node.dataandnode.ingest: false
 
- in 
- 10 data nodes running proprietary data ingestion (node.js / librdkafka) with Windows logs, PaloAlto events, etc
- 
node.data: true;
- 
node.masterandnode.ingest: false
 
- 
elasticsearch.yml
Settings common to all nodes (sanitized):
http.max_content_length: 500mb
http.max_initial_line_length: 32kb
cluster.name: ...
bootstrap.memory_lock: true
network.host: 0.0.0.0
indices.fielddata.cache.size: 30% 
indices.breaker.fielddata.limit: 40% 
gateway.recover_after_time: 5m
discovery.zen.ping_timeout: 15s 
discovery.zen.fd.ping_interval: 10s 
discovery.zen.fd.ping_timeout: 60s 
gateway.expected_nodes: 13
discovery.zen.minimum_master_nodes: 2
gateway.recover_after_nodes: 13
discovery.zen.ping.unicast.hosts: [...]
cluster.initial_master_nodes: [...]
transport.tcp.connect_timeout: 120s
network.publish_host: _eth0:ipv4_
node.name: "..."
Java Runtime Configuration
OpenJDK 64-Bit version "11.0.3" 2019-04-16
java
 -Xshare:auto
 -Des.networkaddress.cache.ttl=60
 -Des.networkaddress.cache.negative.ttl=10
 -XX:+AlwaysPreTouch
 -Xss1m
 -Djava.awt.headless=true
 -Dfile.encoding=UTF-8
 -Djna.nosys=true
 -XX:-OmitStackTraceInFastThrow
 -XX:+ShowCodeDetailsInExceptionMessages
 -Dio.netty.noUnsafe=true
 -Dio.netty.noKeySetOptimization=true
 -Dio.netty.recycler.maxCapacityPerThread=0
 -Dio.netty.allocator.numDirectArenas=0
 -Dlog4j.shutdownHookEnabled=false
 -Dlog4j2.disable.jmx=true
 -Djava.locale.providers=SPI,COMPAT
 -Xms16g
 -Xmx16g
 -XX:+UseG1GC
 -XX:G1ReservePercent=25
 -XX:InitiatingHeapOccupancyPercent=30
 -Djava.io.tmpdir=/tmp/elasticsearch-12345678...
 -XX:HeapDumpPath=data
 -XX:ErrorFile=logs/hs_err_pid%p.log
 -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
 -XX:MaxDirectMemorySize=8589934592
 -Des.path.home=/home/myuser/search
 -Des.path.conf=/home/myuser/search/config
 -Des.distribution.flavor=oss
 -Des.distribution.type=tar
 -Des.bundled_jdk=true
 -cp /home/myuser/search/lib/* org.elasticsearch.bootstrap.Elasticsearch -d
I will add more troubleshooting information soon...