Elasticsearch 7.7.1 shards getting unassigned

We recently upgraded our elasticsearch cluster from 5.6.16 to 7.7.1.

After that sometimes I am observing that few of the shards not getting assigned.

My node stats is placed here.

The allocation explanation for an unassigned shard is like below

ubuntu@platform2:~$      curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
> {
>   "index": "denorm",
>   "shard": 14,
>   "primary": false
> }
> '
{
  "index" : "denorm",
  "shard" : 14,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-11-19T13:09:42.072Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "0_00hk5IRcmgrHGYjpV1jA",
      "node_name" : "platform2",
      "transport_address" : "10.62.70.178:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "9ltF-KXGRk-xMF_Ef1DAng",
      "node_name" : "platform3",
      "transport_address" : "10.62.70.179:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[denorm][14], node[9ltF-KXGRk-xMF_Ef1DAng], [P], s[STARTED], a[id=SNyCoFUzSwaiIE4187Tfig]]"
        }
      ]
    },
    {
      "node_id" : "ocKks7zJT7OODhse-yveyg",
      "node_name" : "platform1",
      "transport_address" : "10.62.70.177:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    }
  ]
}

As mentioned here in node stats I am observing that out of 4.3 GB heap only about ~85 MB is used for keeping in-memory data structures.

As discussed here on setting indices.breaker.total.use_real_memory: false I am not seeing the Data too large exception.

Can some one let me know how can I confirm if I am observing the same issue as discussed here?

Shard allocation is placed here.

Elasticsearch thinks it's using 4.1GB of heap; why do you think it's only using ~85MB?

That looks like far too many shards for such small indices.

1 Like

Thanks @DavidTurner for replying. I misread the stats.

I am seeing from the stats (that is posted in my question)

"jvm": {
        "timestamp": 1605796569391,
        "uptime_in_millis": 9016168,
        "mem": {
          "heap_used_in_bytes": 3468059000,
          "heap_used_percent": 73,
          "heap_committed_in_bytes": 4712169472,
          "heap_max_in_bytes": 4712169472,
          "non_heap_used_in_bytes": 167725288,
          "non_heap_committed_in_bytes": 180129792,
          "pools": {
            "young": {
              "used_in_bytes": 900349776,
              "max_in_bytes": 907345920,
              "peak_used_in_bytes": 907345920,
              "peak_max_in_bytes": 907345920
            },
            "survivor": {
              "used_in_bytes": 112164136,
              "max_in_bytes": 113377280,
              "peak_used_in_bytes": 113377280,
              "peak_max_in_bytes": 113377280
            },
            "old": {
              "used_in_bytes": 2455545088,
              "max_in_bytes": 3691446272,
              "peak_used_in_bytes": 3691444384,
              "peak_max_in_bytes": 3691446272
            }
          }
        },

Just for my undestanding why it is showing real usage: [4479322256/4.1gb] in the circuit breaker log?

Just to add same configuration with same load working fine with elasticsearch 5.6.16 we are seeing this issue only with our elasticsearch 7.7.1 clusters.

That's the heap usage at the time of the failure.

You might do better with 7.9 which has the benefit of https://github.com/elastic/elasticsearch/pull/58674. Apart from that my best suggestion is to reduce your shard count.

@DavidTurner One last query, we are using the default ConcMarkSweepGC. https://github.com/elastic/elasticsearch/pull/58674 will be helpful in case of ConcMarkSweepGC also or it is specific to G1GC ?

Good point, the focus of that PR is indeed G1GC.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.