Help with unassigned shards / CircuitBreakingException / Values less than -1 bytes are not supported

Hi Elasticians, I'm looking for help fixing an issue (or family of issues) that causes a Elasticsearch v7.8.1 cluster to go consistently to yellow state due to unassigned shard(s) that sometimes cannot "auto-heal". This issue is usually observed with CircuitBreakingException [parent] of a variety of reasons in the logs and, sometimes, trying to check on cluster state through REST APIS is not possible due to illegal_argument_exception caused by "Values less than -1 bytes are not supported" (see #42725).

An instance of this issue happened recently and I collected the possible diags from it; it was then auto fixed after a cluster restart, so I suspect once the memory was cleared, then Elasticsearch was able to assign the replicas and get back to green.

Cluster Setup

It has 13 Ubuntu 18.04.2 LTS nodes on AWS EC2 t2.2xlarge servers (8 vCPUs, 32 GiB RAM, 100 GiB SSD reserved for data), being:

  • 3 master-eligible nodes, for proprietary GUI queries, aggregations and analytics
    • in cluster.initial_master_nodes list
    • node.master: true
    • node.data and node.ingest: false
  • 10 data nodes running proprietary data ingestion (node.js / librdkafka) with Windows logs, PaloAlto events, etc
    • node.data: true;
    • node.master and node.ingest: false
elasticsearch.yml

Settings common to all nodes (sanitized):

http.max_content_length: 500mb
http.max_initial_line_length: 32kb
cluster.name: ...
bootstrap.memory_lock: true
network.host: 0.0.0.0
indices.fielddata.cache.size: 30% 
indices.breaker.fielddata.limit: 40% 
gateway.recover_after_time: 5m
discovery.zen.ping_timeout: 15s 
discovery.zen.fd.ping_interval: 10s 
discovery.zen.fd.ping_timeout: 60s 
gateway.expected_nodes: 13
discovery.zen.minimum_master_nodes: 2
gateway.recover_after_nodes: 13
discovery.zen.ping.unicast.hosts: [...]
cluster.initial_master_nodes: [...]
transport.tcp.connect_timeout: 120s
network.publish_host: _eth0:ipv4_
node.name: "..."
Java Runtime Configuration

OpenJDK 64-Bit version "11.0.3" 2019-04-16

java
 -Xshare:auto
 -Des.networkaddress.cache.ttl=60
 -Des.networkaddress.cache.negative.ttl=10
 -XX:+AlwaysPreTouch
 -Xss1m
 -Djava.awt.headless=true
 -Dfile.encoding=UTF-8
 -Djna.nosys=true
 -XX:-OmitStackTraceInFastThrow
 -XX:+ShowCodeDetailsInExceptionMessages
 -Dio.netty.noUnsafe=true
 -Dio.netty.noKeySetOptimization=true
 -Dio.netty.recycler.maxCapacityPerThread=0
 -Dio.netty.allocator.numDirectArenas=0
 -Dlog4j.shutdownHookEnabled=false
 -Dlog4j2.disable.jmx=true
 -Djava.locale.providers=SPI,COMPAT
 -Xms16g
 -Xmx16g
 -XX:+UseG1GC
 -XX:G1ReservePercent=25
 -XX:InitiatingHeapOccupancyPercent=30
 -Djava.io.tmpdir=/tmp/elasticsearch-12345678...
 -XX:HeapDumpPath=data
 -XX:ErrorFile=logs/hs_err_pid%p.log
 -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
 -XX:MaxDirectMemorySize=8589934592
 -Des.path.home=/home/myuser/search
 -Des.path.conf=/home/myuser/search/config
 -Des.distribution.flavor=oss
 -Des.distribution.type=tar
 -Des.bundled_jdk=true
 -cp /home/myuser/search/lib/* org.elasticsearch.bootstrap.Elasticsearch -d

I will add more troubleshooting information soon...

Here are some shard allocation exceptions seen on the Support Diagnostics Utility sanitized output showing CircuitBreakingException and ShardLockObtainFailedException:

cat indices.json | jq '.[] | select (.ur == "ALLOCATION_FAILED")'

{
  "index": "83f24f8-stream-11",
  "shard": "1",
  "prirep": "r",
  "state": "INITIALIZING",
  "docs": null,
  "store": null,
  "id": "5e34050cb972311283741b",
  "node": "448.473.434.468",
  "sm": null,
  "fm": null,
  "fe": null,
  "ft": null,
  "iiti": null,
  "iito": null,
  "idto": null,
  "idti": null,
  "ftt": null,
  "ua": "2020-12-02T01:31:09.548Z",
  "ud": "failed shard on node [31a23419452c38e69a31b0]: failed to perform 
indices:data/write/bulk[s]  on replica [83f24f8-stream-11][1], 
node[31a23419452c38e69a31b0], [R], s[STARTED], 
a[id=KSqkws2SR_ik3AfGzwrH9w], failure 
RemoteTransportException[[448.473.434.408][448.473.434.408:9300]
[indices:data/write/bulk[s][r]]]; 
nested: CircuitBreakingException[[parent] Data too large, data for 
[indices:data/write/bulk[s][r]] would be  [16425569298/15.2gb], 
which is larger than the limit of [16320875724/15.1gb], 
real usage: [16401644416/15.2gb], 
new bytes reserved: [23924882/22.8mb], 
usages [request=0/0b, fielddata=0/0b, in_flight_requests=48267630/46mb, 
accounting=3186816/3mb]]; ",
  "ur": "ALLOCATION_FAILED"
}

Other exceptions:

ShardLockObtainFailedException
{
  "index": "83f24f8-stream-12",
  "shard": "1",
  "prirep": "r",
  "state": "INITIALIZING",
  "docs": null,
  "store": null,
  "id": "0190d6de3b303366a07a95",
  "node": "448.473.434.524",
  "sm": null,
  "fm": null,
  "fe": null,
  "ft": null,
  "iiti": null,
  "iito": null,
  "idto": null,
  "idti": null,
  "ftt": null,
  "ua": "2020-12-02T01:22:17.237Z",
  "ud": "failed shard on node [5e34050cb972311283741b]: failed to create shard, 
failure IOException[failed to obtain in-memory shard lock]; nested: 
ShardLockObtainFailedException[[83f24f8-stream-12][1]:  obtaining shard lock 
timed out after 5000ms, previous lock details: [shard creation] trying to lock 
for [shard creation]]; ",
  "ur": "ALLOCATION_FAILED"
}
CircuitBreakingException on indices:data/write/bulk
{
  "index": "main-stream-7",
  "shard": "1",
  "prirep": "r",
  "state": "INITIALIZING",
  "docs": null,
  "store": null,
  "id": "bdb2ebd0fbf63a5482978a",
  "node": "448.473.434.470",
  "sm": null,
  "fm": null,
  "fe": null,
  "ft": null,
  "iiti": null,
  "iito": null,
  "idto": null,
  "idti": null,
  "ftt": null,
  "ua": "2020-12-02T01:20:49.253Z",
  "ud": "failed shard on node [0190d6de3b303366a07a95]: failed to perform 
indices:data/write/bulk[s] on replica [main-stream-7][1], 
node[0190d6de3b303366a07a95], [R], s[STARTED], 
a[id=Rd90vaSzRy6cHoKd2dynYw], failure 
RemoteTransportException[[448.473.434.524][448.473.434.524:9300]
[indices:data/write/bulk[s][r]]]; 
nested: CircuitBreakingException[[parent] Data too large, data for
 [indices:data/write/bulk[s][r]] would be [16490950262/15.3gb], 
which is larger than the limit of [16320875724/15.1gb], 
real usage: [16457950848/15.3gb], new bytes reserved: [32999414/31.4mb], 
usages [request=0/0b, fielddata=0/0b, in_flight_requests=32999414/31.4mb, 
    accounting=1221018/1.1mb]]; ",
  "ur": "ALLOCATION_FAILED"
}
CircuitBreakingException on indices:data/write/bulk
{
  "index": "main-stream-7",
  "shard": "0",
  "prirep": "r",
  "state": "INITIALIZING",
  "docs": null,
  "store": null,
  "id": "0190d6de3b303366a07a95",
  "node": "448.473.434.524",
  "sm": null,
  "fm": null,
  "fe": null,
  "ft": null,
  "iiti": null,
  "iito": null,
  "idto": null,
  "idti": null,
  "ftt": null,
  "ua": "2020-12-02T01:25:03.299Z",
  "ud": "failed shard on node [5e34050cb972311283741b]: failed to perform 
indices:data/write/bulk[s] on replica [main-stream-7][0],
 node[5e34050cb972311283741b], [R], s[STARTED], 
a[id=rSfY_8qpTdSF--LafnLo_w], failure 
RemoteTransportException[[448.473.434.468][448.473.434.468:9300]
[indices:data/write/bulk[s][r]]]; 
nested: CircuitBreakingException[[parent] Data too large, data for 
[indices:data/write/bulk[s][r]] would be [16462819846/15.3gb], 
which is larger than the limit of [16320875724/15.1gb], 
real usage: [16427059848/15.2gb], new bytes reserved: [35759998/34.1mb], 
usages [request=0/0b, fielddata=0/0b, in_flight_requests=131115656/125mb, 
accounting=2088240/1.9mb]]; ",
  "ur": "ALLOCATION_FAILED"
}

Hey @Yang_Wang, we discussed this issue during the ElasticON; you asked me to ping you here so you could continue to assist me.

Are you still available? If not, could you refer it to a colleague? Thanks!

Support Diagnostic data

I have captured and sanitized the support diagnostic data, but I would rather share the zip file privately.

I see that you are using AWS t2 series instances. I have in the past seen issues with these as they can get starved for CPU during load which can cause slow cluster updates and instability. I would generally only consider using these for clusters under light load or for dedicated master. nodes. It might be worth looking at CPU usage/credits over time for the various nodes to see if this could be a contributing factor or try a different instance type to see if that makes a difference, e.g. m5 series instances.

I also see that you have some overrides for settings that are considered expert settings. This can also affect stability. How did you arrive at these settings?

Hi Christian, I'm using AWS t2.2xlarge servers (8 vCPUs, 32 GiB RAM, 100 GiB SSDfor data) on test environment to reproduce the issues that are happening in production on VMWare ESXi servers (16 vCPUs 3GHz+, 64GB RAM, 2TB Enterprise SSD) .

I have very limited access to the production environment, which is using Elasticsearch v7.1.0 and I've seen the same symptoms. I was able to reproduce the issue on this AWS cluster on v7.1.0 and then tried the upgrade to v7.8.1, observing it happening again. I can double check CPU usage as you suggested, although I'm tempted to consider the issue on the hardware as the last option, since that should not be the case in production.

Which specific settings are you referring to?
These setting have been fine tuned by other engineers, to whom I don't have access, except for the source code comments and commit messages. But I can recognize that they were very experienced detail-oriented experts with broad and deep technical knowledge and have gone now to different endeavors!

After trying the ES upgrade, I added G1ReservePercent=25 and changed InitiatingHeapOccupancyPercent from 75 into 30 following recommendations on the forums and the "new" defaults. Both settings didn't prevent the issue.

What is the full output of the cluster stats API?

Unfortunately when I faced these issues, _/cluster/stats returned

Bad Request. Rejected
{
  "error" : { 
    "root_cause" : [ 
      {   
        "type" : "illegal_argument_exception",
        "reason" : "Values less than -1 bytes are not supported: -9223372036787056125b"
      }   
    ],  
    "type" : "illegal_argument_exception",
    "reason" : "Values less than -1 bytes are not supported: -9223372036787056125b",
    "suppressed" : [ 
      {   
        "type" : "illegal_state_exception",
        "reason" : "Failed to close the XContentBuilder",
        "caused_by" : { 
          "type" : "i_o_exception",
          "reason" : "Unclosed object or array found"
        }
      }   
    ]   
  },  
  "status" : 400 
}

Please let me know if you're interested on the results of this API after restarting the cluster.

Can you post the full command you ran and the response?

It was the /_cluster/stats?pretty&human call from the Support Diagnostic data. Let me grab the zipped file.

Are you able to run the command directly against the cluster then?

How can I upload or send you the zip file from the diagnostic utility?

I think it'd be easier if you could run that command directly against the cluster and post the output here.

Here is the current output of /_cluster/stats?pretty. Please notice that I turned off data collection, so the system is not under heavy load and the previous errors were "auto-fixed" after cluster restart.

{
  "_nodes" : {
    "total" : 13,
    "successful" : 13,
    "failed" : 0
  },
  "cluster_name" : "Sanitized",
  "cluster_uuid" : "Sanitized",
  "timestamp" : 1607472639242,
  "status" : "green",
  "indices" : {
    "count" : 7,
    "shards" : {
      "total" : 24,
      "primaries" : 12,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 4,
          "avg" : 3.4285714285714284
        },
        "primaries" : {
          "min" : 1,
          "max" : 2,
          "avg" : 1.7142857142857142
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 110055238,
      "deleted" : 19291759
    },
    "store" : {
      "size_in_bytes" : 149769977421
    },
    "fielddata" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 20194120,
      "total_count" : 3200,
      "hit_count" : 3043,
      "miss_count" : 157,
      "cache_size" : 49,
      "cache_count" : 49,
      "evictions" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 349,
      "memory_in_bytes" : 5457876,
      "terms_memory_in_bytes" : 2449072,
      "stored_fields_memory_in_bytes" : 2439304,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 21056,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 548444,
      "index_writer_memory_in_bytes" : 449371472,
      "version_map_memory_in_bytes" : 515817362,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 1606945773468,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "date",
          "count" : 25,
          "index_count" : 5
        },
        {
          "name" : "ip",
          "count" : 3,
          "index_count" : 3
        },
        {
          "name" : "keyword",
          "count" : 180,
          "index_count" : 7
        },
        {
          "name" : "long",
          "count" : 138,
          "index_count" : 7
        },
        {
          "name" : "object",
          "count" : 55,
          "index_count" : 6
        },
        {
          "name" : "text",
          "count" : 63,
          "index_count" : 7
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 13,
      "coordinating_only" : 0,
      "data" : 10,
      "ingest" : 0,
      "master" : 3,
      "remote_cluster_client" : 13
    },
    "versions" : [
      "7.8.1"
    ],
    "os" : {
      "available_processors" : 104,
      "allocated_processors" : 104,
      "names" : [
        {
          "name" : "Linux",
          "count" : 13
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 18.04.2 LTS",
          "count" : 13
        }
      ],
      "mem" : {
        "total_in_bytes" : 438491262976,
        "free_in_bytes" : 77731299328,
        "used_in_bytes" : 360759963648,
        "free_percent" : 18,
        "used_percent" : 82
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 88
      },
      "open_file_descriptors" : {
        "min" : 497,
        "max" : 562,
        "avg" : 523
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 515181326,
      "versions" : [
        {
          "version" : "14.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "14.0.1+7",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 13
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 85276715200,
        "heap_max_in_bytes" : 223338299392
      },
      "threads" : 939
    },
    "fs" : {
      "total_in_bytes" : 1366923763712,
      "free_in_bytes" : 1046325751808,
      "available_in_bytes" : 976317198336
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 13
      },
      "http_types" : {
        "netty4" : 13
      }
    },
    "discovery_types" : {
      "zen" : 13
    },
    "packaging_types" : [
      {
        "flavor" : "oss",
        "type" : "tar",
        "count" : 13
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 0,
      "processor_stats" : { }
    }
  }
}

Here is some details about the Bulk API from the logs of the javascript client, on two of the servers that had the CircuitBreakingException issue for indices:data/write/bulk onto the main-stream-7 index around the error time (from /_cat/shards logs):

  1. Data server: 448.473.434.470
    a. unassigned.at: 2020-12-02T01:20:49.253Z*
    b. 624 bulk API calls ~17min range: 2020-12-02T01:10:17 to 01:27:50Z
    c. average request size: 2.5 MB, 40k lines, 1k records
    d. ~600k records sent to "main-stream-8"
    e. average 40 lines or 3KB per record
    f. sample bulk request at 01:19:06Z

    • 2.4k records, total 6MB, average 40 lines, 4KB
  2. Data server: 448.473.434.524
    a. unassigned.at: 2020-12-02T01:22:17.237Z
    b. 1086 bulk API calls ~30min range: 2020-12-02T01:00:16 to 01:30:21Z
    c. average request size: 1.5 MB, 50k lines, 1.5k records
    d. ~1.5m records sent to "83f24f8-stream-12"
    e. average 30 lines or 1KB per record
    f. sample bulk request at 01:20:07Z

    • 4k records, total 4MB, average 40 lines, 4KB

* obs 2020-12-02T01:20:49.253Z == 2020-12-01T17:20:49.253 GMT-8:00

The next time you see the Values less than -1 bytes are not supported please add the error_trace query parameter so we can see exactly where the unexpected negative number is. For instance: GET /_cluster/stats?pretty&human&error_trace.

Newer versions have better resilience against a CircuitBreakingException causing an unassigned shard. Upgrading will likely help too.

Cool, David, sure will do it!

Are there any other advanced debugging, troubleshooting, stack tracing options or free tools I can leverage, like the error_trace flag you mentioned?

Which version specifically will you recommend? I can give it a try too. v7.8.1 was chosen for it didn't have a major Lucene bump (which may not be a concern) and the higher versions weren't then "battlefield tested" yet.

Do you have evidences that between v7.8.1 and v7.10.0 there were fixes on that memory management area of code? Thanks so much!

So the better resilience is things like https://github.com/elastic/elasticsearch/pull/55633, and the better back pressure is Indexing pressure | Elasticsearch Guide [7.10] | Elastic. IIRC both of these are in 7.9.

1 Like

Just error_trace, thanks. There were a few bugs like this where we assumed sizes are always ≥0 but then a newer JVM started returning negative numbers. They should mostly be fixed now, but the stack trace of the exception will tell us for sure.

We don't do major Lucene bumps in between major Elasticsearch releases - the whole 7.x series is on Lucene 8.x.

Yes, I verified that 55633 has the 7.9 label and the docs page on Indexing pressure is the same for 7.9 and 7.10. These are strong reasons for trying 7.9 or 7.10. Thanks for mentioning these changes.

Cool! I will try a higher version and will use this flag if I see this error again.

That is true; my mistake. I double checked my notes, it was actually a minor bump to Lucene 8.6.0 associated with a memory leak in v7.9.0. I suppose now this has been fixed in 7.9.1 and more broadly tested in v7.9.1+