Shards Not Being Allocated To Nodes

Hope to get assistance here as I have been struggling.

I have 11 nodes that run Elasticsearch, with 2 master node. 5 shards and 1 replica (default)

We recently ran patches on the nodes and upgraded the .Net software, there after only 2 shards are now allocated to 2 nodes and the other 9 do not have any shards allocated to them. I have ran the _cluster/reroute?retry_failed=true using Postman but this has not helped.

We see the below statuses.
allocation status: no valid shard copy
allocation status: no attempt
allocate explanation : cannot allocate because all found copies of the shard are either stale or corrupt

The above is causing my Indexer IIS application not to write any new documents to the data directories.

I am running version 5.6 of Elasticsearch.

Welcome to our community! :smiley:

5.X is extremely old and well past, EOL. Please upgrade as a matter of urgency!

What is the output from the _cluster/stats?pretty&human API?
What do your Elasticsearch logs show?

Thanks Mark.

Output:

    "_nodes": {
        "total": 11,
        "successful": 11,
        "failed": 0
    },
    "cluster_name": "elasticsearch",
    "timestamp": 1634114914482,
    "status": "red",
    "indices": {
        "count": 1,
        "shards": {
            "total": 2,
            "primaries": 1,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 2,
                    "avg": 2.0
                },
                "primaries": {
                    "min": 1,
                    "max": 1,
                    "avg": 1.0
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 300,
            "deleted": 140
        },
        "store": {
            "size": "27.6gb",
            "size_in_bytes": 29647417194,
            "throttle_time": "0s",
            "throttle_time_in_millis": 0
        },
        "fielddata": {
            "memory_size": "0b",
            "memory_size_in_bytes": 0,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "67.2kb",
            "memory_size_in_bytes": 68904,
            "total_count": 4765190,
            "hit_count": 252225,
            "miss_count": 4512965,
            "cache_size": 0,
            "cache_count": 2885,
            "evictions": 2885
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 5,
            "memory": "65.8kb",
            "memory_in_bytes": 67440,
            "terms_memory": "21.5kb",
            "terms_memory_in_bytes": 22087,
            "stored_fields_memory": "1.7kb",
            "stored_fields_memory_in_bytes": 1776,
            "term_vectors_memory": "1.5kb",
            "term_vectors_memory_in_bytes": 1632,
            "norms_memory": "1.2kb",
            "norms_memory_in_bytes": 1280,
            "points_memory": "197b",
            "points_memory_in_bytes": 197,
            "doc_values_memory": "39.5kb",
            "doc_values_memory_in_bytes": 40468,
            "index_writer_memory": "0b",
            "index_writer_memory_in_bytes": 0,
            "version_map_memory": "0b",
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set": "352b",
            "fixed_bit_set_memory_in_bytes": 352,
            "max_unsafe_auto_id_timestamp": -1,
            "file_sizes": {}
        }
    },
    "nodes": {
        "count": {
            "total": 11,
            "data": 11,
            "coordinating_only": 0,
            "master": 2,
            "ingest": 0
        },
        "versions": [
            "5.6.16"
        ],
        "os": {
            "available_processors": 164,
            "allocated_processors": 164,
            "names": [
                {
                    "name": "Windows Server 2012 R2",
                    "count": 11
                }
            ],
            "mem": {
                "total": "139.9gb",
                "total_in_bytes": 150317256704,
                "free": "32.3gb",
                "free_in_bytes": 34764726272,
                "used": "107.6gb",
                "used_in_bytes": 115552530432,
                "free_percent": 23,
                "used_percent": 77
            }
        },
        "process": {
            "cpu": {
                "percent": 2
            },
            "open_file_descriptors": {
                "min": -1,
                "max": -1,
                "avg": 0
            }
        },
        "jvm": {
            "max_uptime": "1.8d",
            "max_uptime_in_millis": 159343516,
            "versions": [
                {
                    "version": "1.8.0_301",
                    "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
                    "vm_version": "25.301-b09",
                    "vm_vendor": "Oracle Corporation",
                    "count": 11
                }
            ],
            "mem": {
                "heap_used": "15.4gb",
                "heap_used_in_bytes": 16632632416,
                "heap_max": "62.9gb",
                "heap_max_in_bytes": 67550838784
            },
            "threads": 1575
        },
        "fs": {
            "total": "4.2tb",
            "total_in_bytes": 4724419940352,
            "free": "3.5tb",
            "free_in_bytes": 3908139868160,
            "available": "3.5tb",
            "available_in_bytes": 3908139868160
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "netty4": 11
            },
            "http_types": {
                "netty4": 11
            }
        }
    }
}


**The ES logs showed the below yesterday, where as today they say the master nodes are being detected.**

2021-10-12T21:58:50,300][WARN ][r.suppressed             ] path: /default_sm_index_%2A/interactiondata/_search, params: {index=default_sm_index_*, type=interactiondata}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed


**My index logs show the below when making a bulk update:**

```2021-10-13 10:51:26.686 [WRN] Tenant "default": failed to update index.
2021-10-13 10:52:39.283 [WRN] Tenant "default": Commit Bulk failed.
System.Exception: Invalid NEST response built from a unsuccessful low level call on POST: /_bulk
# Invalid Bulk items:
# Audit trail of this API call:
 - [1] BadRequest: Node: http://10.102.246.125:9200/ Took: 00:01:00.0833281
 - [2] MaxTimeoutReached: Took: -738075.08:52:39.2826415
# OriginalException: System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request..
 ---> System.Net.Sockets.SocketException (995): The I/O operation has been aborted because of either a thread exit or an application request.```

Please also note that I have 1,280,870 documents waiting to be indexed.

The first suggestion would be to use a newer version of Elasticsearch.

Thanks Mark. Will test the compatibility between my indexer and ES 7.9 in the dev environment first.

Is the a workaround for this before upgrading as this is a Production issue.

As Mark pointed out this is very, very old and you should look to upgrade.

Having 2 master eligible nodes is very bad. You should always look to have 3 master eligible nodes in a cluster as Elasticsearch relies on consensus to be able to elect a master. As you are running such an old version of Elasticsearch you must also make sure you have discovery.zen.minimum_master_nodes defined and set to 2 in your node config. This will prevent your cluster from suffering from split-brain scenarios and the data loss this can cause. If you do not currently have this set correctly it is possible that your data has been lost.

You should also always make sure you backup your data using the snapshot and restore API to ensure you do not lose your data if you suffer a catastrophic failure.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.