Hi Elasticians, I'm looking for help fixing an issue (or family of issues) that causes a Elasticsearch v7.8.1
cluster to go consistently to yellow
state due to unassigned shard(s)
that sometimes cannot "auto-heal". This issue is usually observed with CircuitBreakingException [parent]
of a variety of reasons in the logs and, sometimes, trying to check on cluster state through REST APIS is not possible due to illegal_argument_exception
caused by "Values less than -1 bytes are not supported"
(see #42725).
An instance of this issue happened recently and I collected the possible diags from it; it was then auto fixed after a cluster restart, so I suspect once the memory was cleared, then Elasticsearch was able to assign the replicas and get back to green.
Cluster Setup
It has 13 Ubuntu 18.04.2 LTS nodes on AWS EC2 t2.2xlarge servers (8 vCPUs, 32 GiB RAM, 100 GiB SSD reserved for data), being:
- 3 master-eligible nodes, for proprietary GUI queries, aggregations and analytics
- in
cluster.initial_master_nodes
list node.master: true
-
node.data
andnode.ingest: false
- in
- 10 data nodes running proprietary data ingestion (node.js / librdkafka) with Windows logs, PaloAlto events, etc
-
node.data: true
; -
node.master
andnode.ingest: false
-
elasticsearch.yml
Settings common to all nodes (sanitized):
http.max_content_length: 500mb
http.max_initial_line_length: 32kb
cluster.name: ...
bootstrap.memory_lock: true
network.host: 0.0.0.0
indices.fielddata.cache.size: 30%
indices.breaker.fielddata.limit: 40%
gateway.recover_after_time: 5m
discovery.zen.ping_timeout: 15s
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_timeout: 60s
gateway.expected_nodes: 13
discovery.zen.minimum_master_nodes: 2
gateway.recover_after_nodes: 13
discovery.zen.ping.unicast.hosts: [...]
cluster.initial_master_nodes: [...]
transport.tcp.connect_timeout: 120s
network.publish_host: _eth0:ipv4_
node.name: "..."
Java Runtime Configuration
OpenJDK 64-Bit version "11.0.3" 2019-04-16
java
-Xshare:auto
-Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10
-XX:+AlwaysPreTouch
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-XX:-OmitStackTraceInFastThrow
-XX:+ShowCodeDetailsInExceptionMessages
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dio.netty.allocator.numDirectArenas=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.locale.providers=SPI,COMPAT
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-Djava.io.tmpdir=/tmp/elasticsearch-12345678...
-XX:HeapDumpPath=data
-XX:ErrorFile=logs/hs_err_pid%p.log
-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
-XX:MaxDirectMemorySize=8589934592
-Des.path.home=/home/myuser/search
-Des.path.conf=/home/myuser/search/config
-Des.distribution.flavor=oss
-Des.distribution.type=tar
-Des.bundled_jdk=true
-cp /home/myuser/search/lib/* org.elasticsearch.bootstrap.Elasticsearch -d
I will add more troubleshooting information soon...