Failed to save to index due to maximum shard overlimit

Hello,

I am running a web application (in my own windows server machine) called Automation Anywhere A360. This web application uses a local Elasticsearch instance to handle its Audit Logs.

Cluster health endpoint shows the following:

{
    "cluster_name": "aa_cr_elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 493,
    "active_shards": 493,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 499,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 49.69758064516129
}

The cluster allocation endpoint shows the following:

{
    "index": "bilegacyutility",
    "shard": 2,
    "primary": false,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "CLUSTER_RECOVERED",
        "at": "2023-01-11T19:29:57.256Z",
        "last_allocation_status": "no_attempt"
    },
    "can_allocate": "no",
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
    "node_allocation_decisions": [
        {
            "node_id": "Y2wDy49CSgqkleEfKeQShQ",
            "node_name": "localhost",
            "transport_address": "127.0.0.1:47600",
            "node_decision": "no",
            "deciders": [
                {
                    "decider": "same_shard",
                    "decision": "NO",
                    "explanation": "a copy of this shard is already allocated to this node [[bilegacyutility][2], node[Y2wDy49CSgqkleEfKeQShQ], [P], s[STARTED], a[id=xP-_uMOwSfCxSIVVSIW9vQ]]"
                }
            ]
        }
    ]
}

PROBLEM: Recently some Audit Logs did not pop up in the app and the reason (looking at the logs) is related to sharding:

2023-Jan-09 Mon 15:44:55.539 **ERROR - com.automationanywhere.durablemessaging.DurableMessageTransactionalPublisher - {} - run(DurableMessageTransactionalPublisher.java:460) - Error: com.automationanywhere.es_client.ESRestClientException: Failed to save to index: audit_logs_20230101**
**    at com.automationan**ywhere.es_client.ESRestClient.insertJsonDoc(ESRestClient.java:706) ~[kernel.jar:?]
    at com.automationanywhere.es_client.ESRestClient.insertJsonDoc(ESRestClient.java:765) ~[kernel.jar:?]
    at com.automationanywhere.es_client.ESRestClient.insertJsonDoc(ESRestClient.java:757) ~[kernel.jar:?]
    at com.automationanywhere.audit.model.AuditESPublisher$BatchPublisher.publish(AuditESPublisher.java:36) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessageTopicPublisher$BatchPublisher.publish(DurableMessageTopicPublisher.java:19) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessageTransactionalPublisher.lambda$processTopicMessage$1(DurableMessageTransactionalPublisher.java:677) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessagingBase.lambda$runWithContext$0(DurableMessagingBase.java:64) ~[kernel.jar:?]
    at com.automationanywhere.common.security.context.SecurityContextHelper.runAsUser(SecurityContextHelper.java:253) ~[kernel.jar:?]
    at com.automationanywhere.common.security.context.SecurityContextHelper.runAsUser(SecurityContextHelper.java:238) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessagingBase.runWithContext(DurableMessagingBase.java:78) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessageTransactionalPublisher.processTopicMessage(DurableMessageTransactionalPublisher.java:671) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessageTransactionalPublisher.waitAndProcessMessage(DurableMessageTransactionalPublisher.java:587) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessageTransactionalPublisher.access$400(DurableMessageTransactionalPublisher.java:116) ~[kernel.jar:?]
    at com.automationanywhere.durablemessaging.DurableMessageTransactionalPublisher$2.run(DurableMessageTransactionalPublisher.java:425) [kernel.jar:?]
**Caused by: org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=validation_exception, reason=Validation Failed: 1: this action would add [10] total shards, but this cluster currently has [992]/[1000] maximum shards open;]**
    at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187) ~[kernel.jar:?]
    at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1911) ~[kernel.jar:?]
    at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1888) ~[kernel.jar:?]
    at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1645) ~[kernel.jar:?]
    at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1602) ~[kernel.jar:?]
    at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1572) ~[kernel.jar:?]
    at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:989) ~[kernel.jar:?]
    at com.automationanywhere.es_client.ESRestClient.insertJsonDoc(ESRestClient.java:700) ~[kernel.jar:?]

I must point out that our drive where all this is stored has 234GB free (just FYI).

We know that we can increase sharding limit to more than 1000 (we have not done this as it is not recommended at all), but we would like to know a more mid/long term sustainable solution for this, thank you!

Given you have a single node you don't need replicas, so I would set everything to 0 replicas and that will help in the short term.

Is that safe to do? (safer than setting cluster's shard limit higher than 1000?)

You have a single node, you are already at risk of data loss because you have no replicas assigned.

And how can I set everything to 0? Whenever I try to make this request:

PUT /*/_settings

{
    "index": {
        "number_of_replicas": 0
    }
}

The response is the following: (http status 403 Forbidden)

{
    "error": {
        "root_cause": [
            {
                "type": "security_exception",
                "reason": "no permissions for [] and User [name=es_client, backend_roles=[], requestedTenant=null]"
            }
        ],
        "type": "security_exception",
        "reason": "no permissions for [] and User [name=es_client, backend_roles=[], requestedTenant=null]"
    },
    "status": 403
}

How can I set replicas to 0?

Thanks in advance

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.