Unassigned Shards with error 'Failed to create index' - NullPointerException

I encountered a problem bringing a cluster back from a full cluster restart. The cluster has 3 master, 3 coordinating, 2 hot, 2 warm, and 2 cold nodes.

All nodes appear to have rejoined the cluster successfully:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.2.3.197            8          99   1    0.12    0.22     0.22 dilt      -      eshost1551.example.com
10.2.3.202           67          95  27    1.24    0.95     0.98 ilm       *      eshost1570.example.com
10.2.3.200            3          84   1    0.09    0.08     0.12 dilt      -      eshost1550.example.com
10.2.3.201           13          97   2    0.18    0.15     0.14 ilm       -      eshost1569.example.com
10.2.3.205            9          97   2    0.27    0.15     0.12 l         -      eshost1573.example.com
10.2.3.196            1          63   1    0.13    0.15     0.27 dilt      -      eshost1554.example.com
10.2.3.206            8          97   1    0.04    0.04     0.06 l         -      eshost1574.example.com
10.2.3.199            3          96   2    0.20    0.14     0.14 dilt      -      eshost1549.example.com
10.2.3.204            8          75   2    0.02    0.08     0.12 l         -      eshost1572.example.com
10.2.3.203            9          94   2    0.11    0.11     0.13 ilm       -      eshost1571.example.com
10.2.3.195           63          94   3    0.20    0.18     0.24 dilt      -      eshost1553.example.com
10.2.3.198            2          95   2    0.20    0.23     0.22 dilt      -      eshost1552.example.com

But ~2000 of ~2400 shards are unassigned -- one hot node (eshost1553) has shards allocated to it. Using the "/_cluster/reroute?retry_failed=true" endpoint forces another allocation attempt, but everything just returns to UNASSIGNED status. The "/_cluster/allocation/explain" endpoint says it failed to create the index with a NullPointerException:

{
  "index": "security-auditlog-2022.01.30",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "MANUAL_ALLOCATION",
    "at": "2022-07-01T17:09:15.230Z",
    "details": "failed shard on node [ZZFUy43-So6NBdyhM_dPBw]: failed to create index, failure NullPointerException[null]",
    "last_allocation_status": "awaiting_info"
  },
  "can_allocate": "yes",
  "allocate_explanation": "can allocate the shard",
  "target_node": {
    "id": "jhzbGKS8TQePfepXOb_0kw",
    "name": "eshost1549.example.com",
    "transport_address": "10.2.3.199:9300",
    "attributes": {
      "ml.machine_memory": "67368648704",
      "ml.max_open_jobs": "20",
      "xpack.installed": "true",
      "box_type": "cold",
      "transform.node": "true"
    }
  },
  "allocation_id": "pi2twkWMSly5j9K6-GaGDQ",
  "node_allocation_decisions": [
    {
      "node_id": "ZZFUy43-So6NBdyhM_dPBw",
      "node_name": "eshost1550.example.com",
      "transport_address": "10.2.3.200:9300",
      "node_attributes": {
        "ml.machine_memory": "67368660992",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "cold",
        "transform.node": "true"
      },
      "node_decision": "yes",
      "store": {
        "in_sync": true,
        "allocation_id": "JVRoSHUcTRiuXg_YvcKilw"
      }
    },
    {
      "node_id": "jhzbGKS8TQePfepXOb_0kw",
      "node_name": "eshost1549.example.com",
      "transport_address": "10.2.3.199:9300",
      "node_attributes": {
        "ml.machine_memory": "67368648704",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "cold",
        "transform.node": "true"
      },
      "node_decision": "yes",
      "store": {
        "in_sync": true,
        "allocation_id": "pi2twkWMSly5j9K6-GaGDQ"
      }
    },
    {
      "node_id": "Jw9QJ3S5Tvyd36-iT8wprw",
      "node_name": "eshost1552.example.com",
      "transport_address": "10.2.3.198:9300",
      "node_attributes": {
        "ml.machine_memory": "67368648704",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "warm",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "t0jOfIYASESl8rQ66RPUyg",
      "node_name": "eshost1551.example.com",
      "transport_address": "10.2.3.197:9300",
      "node_attributes": {
        "ml.machine_memory": "67368660992",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "warm",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "z5llgKCRQ16M1rG0QhiHew",
      "node_name": "eshost1554.example.com",
      "transport_address": "10.2.3.196:9300",
      "node_attributes": {
        "ml.machine_memory": "67368419328",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "hot",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "zN0fjtZhTHiNkxtW5f3zqg",
      "node_name": "eshost1553.example.com",
      "transport_address": "10.2.3.195:9300",
      "node_attributes": {
        "ml.machine_memory": "67368419328",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "hot",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

Debug logging shows basically the same thing:

[2022-07-01T11:38:39,888][DEBUG][o.e.i.IndicesService     ] [eshost.example.com] creating Index [[security-auditlog-2020.11.26/Ne-p6ajnSyCv-5xM-UcOMA]], shards [1]/[1] - reason [CREATE_INDEX]
[2022-07-01T11:38:39,893][DEBUG][o.e.g.PersistedClusterStateService] [eshost.example.com] writing cluster state took [0ms]; wrote global metadata [false] and metadata for [16] indices and skipped [760] unchanged indices
[2022-07-01T11:38:39,894][DEBUG][c.a.o.s.OpenDistroSecurityPlugin] [eshost.example.com] Handle complianceConfig=null/dlsFlsAvailable: true/auditLog=class com.amazon.opendistroforelasticsearch.security.auditlog.NullAuditLog for onIndexModule() of index security-auditlog-2020.11.26
[2022-07-01T11:38:39,894][WARN ][o.e.i.c.IndicesClusterStateService] [eshost.example.com] [security-auditlog-2020.11.26][0] marking and sending shard failed due to [failed to create index]
java.lang.NullPointerException: null
        at com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin.onIndexModule(OpenDistroSecurityPlugin.java:519) ~[?:?]
        at org.elasticsearch.plugins.PluginsService.onIndexModule(PluginsService.java:267) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:607) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:549) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:176) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:484) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:246) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.7.0.jar:7.7.0]
        at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_251]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_251]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_251]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_251]
[2022-07-01T11:38:39,895][DEBUG][o.e.c.a.s.ShardStateAction] [eshost.example.com] sending [internal:cluster/shard/failure] to [YpOcM6LtTIKVV3swKpZAWA] for shard entry [shard id [[security-auditlog-2020.11.26][0]], allocation id [n1zocy8ZTFqgFckuyVd9lg], primary term [0], message [failed to create index], failure [NullPointerException[null]], markAsStale [true]]

Trace logging shows:

-- index [[security-auditlog-2020.11.26/Ne-p6ajnSyCv-5xM-UcOMA]]
----shard_id [security-auditlog-2020.11.26][0]
--------[security-auditlog-2020.11.26][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=MANUAL_ALLOCATION], at[2022-07-01T17:05:16.424Z], delayed=false, details[failed shard on node [z5llgKCRQ16M1rG0QhiHew]: failed to create index, failure NullPointerException[null]], allocation_status[deciders_throttled]]
--------[security-auditlog-2020.11.26][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2022-07-01T04:17:17.657Z], delayed=false, allocation_status[no_attempt]]

What can we do to understand what is causing the shards to remain unassigned?

The null pointer exception seems related to OpenDistro, which is not supported here. I would recommend you open an issue at the OpenDistro/OpenSearch forum.

1 Like

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

Thanks for the pointer -- we'll dig into the OpenDistro plug-in and see what we can find.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.