Issue: Nodes Getting Deleted Suddenly in Elasticsearch

I am experiencing an issue where nodes in our Elasticsearch cluster are getting deleted suddenly. This issue is causing shard allocation failures and recovery problems. Below are the anonymized logs related to the issue:

Logs

{
  "index" : "sharedb",
  "node_allocation_decisions" : [
    {
      "node_name" : "node1",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-10-30T11:06:31.920Z], failed_attempts[5], failed_nodes[[node2]], delayed=false, last_node[node2], details[failed shard on node [node2]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [sharedb][0]: Recovery failed on {node1}{node2}{info1}{node1}{ip}{ip:9300}{cdfhilmrstw}{8.13.2}{7000099-8503000}{ml.machine_memory=23622320128, ml.allocated_processors=32, ml.allocated_processors_double=32.0, ml.max_jvm_size=11811160064, ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0}\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$36(IndexShard.java:3313)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$9(StoreRecovery.java:394)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.lambda$internalRecoverFromStore$12(StoreRecovery.java:498)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onFailure(ActionListenerImplementations.java:317)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:378)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:290)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:189)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:165)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:494)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:94)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2449)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: [sharedb/id1][[sharedb][0]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway\n\t... 21 more\nCaused by: [sharedb/id1][[sharedb][0]] org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:274)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:224)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:14)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.createEngine(IndexShard.java:2125)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2098)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2059)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.lambda$internalRecoverFromStore$10(StoreRecovery.java:485)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener$SuccessResult.complete(SubscribableListener.java:366)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:286)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:189)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.lambda$andThen$0(SubscribableListener.java:437)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListener.run(ActionListener.java:356)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.newForked(SubscribableListener.java:128)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.andThen(SubscribableListener.java:437)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.andThen(SubscribableListener.java:411)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:417)\n\t... 8 more\nCaused by: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /usr/share/elasticsearch/data/indices/id1/0/index/write.lock\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:117)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:106)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:106)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:953)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.createWriter(InternalEngine.java:2662)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.createWriter(InternalEngine.java:2650)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:267)\n\t... 24 more\n], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

Could you please provide guidance on how to resolve this issue? Any insights or suggestions would be greatly appreciated. Thank you!

Hello and welcome,

You need to provide more context, what indicates that the node is deleted? How are you running Elasticsearch? Directly on VMs? Docker? ECK?

The log you shared indicates that Elasticsearch was not able to allocate the shard for this index on the node named node2, and if you check the full error log you will see something like this:

Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /usr/share/elasticsearch/data/indices/id1/0/index/write.lock

This suggests that there is another app messing with Elasticsearch files, you need to validate that in the node.

There's a small error on my part it's not that nodes got deleted but the primary shards get unallocated or the data in some indices goes missing.

I am using elasticsearch v8.13.2 with docker and single node. I have been facing loss of docs at random indices. this used to get resolved automatically earlier on increasing the ram size, but at current stage that is also not helping.

I am using this command for docker -

docker run -d --restart unless-stopped \
  --name elasticsearch \
  --net elastic \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.security.audit.enabled=true" \
  -e "xpack.security.audit.logfile.events.emit_request_body=true" \
  -e "logger.org.elasticsearch.transport=TRACE" \
  -m 2GB \
  -v elasticsearch-data:/usr/share/elasticsearch/data \
  -v /home/elasticsearch/backup:/home/elasticsearch/backup \
  -v /var/log/elasticsearch:/usr/share/elasticsearch/logs \
  --log-opt max-size=200m \
  --log-opt max-file=5 \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.2

docker logs shows this type of exception -

{"@timestamp":"2024-10-23T13:10:45.030Z", "log.level": "WARN", "message":"Received response for a request that has timed out, sent [1.1m/69846ms] ago, timed out [54.8s/54836ms] ago, action [indices:monitor/stats[n]], node [{*****}{kn5WbDj1SNmnnzaEdYtAQA}{*****}{***}{172.18.0.2}{172.18.0.2:9300}{***}{8.13.2}{7000099-8503000}{ml.machine_memory=23622320128, ml.allocated_processors=32, ml.allocated_processors_double=32.0, ml.max_jvm_size=11811160064, ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0}], id [105027]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[e6f71666ef1c][management][T#4]","log.logger":"org.elasticsearch.transport.TransportService","elasticsearch.cluster.uuid":"rFFNxXCdSuWNql4Ue33KCQ","elasticsearch.node.id":"kn5WbDj1SNmnnzaEdYtAQA","elasticsearch.node.name":"e6f71666ef1c","elasticsearch.cluster.name":"docker-cluster"}

Something is not right.

In your logs there are multiple nodes mentioned, like node1 and node2, but you shared a docker command where you start a single-node cluster.

So, this is confusing, how exactly are you running it? Because your docker command does not match with the log you shared.

Also, avoid redacting the node names in logs as this makes everything even more confusing, it is not clear how many nodes you have now, you need to indicate from which node is the log, redacting the node name without replacing it with some kind of identificator does not help.

1 Like

So , there is only a single node set up in the docker container, in the logs it is trying to assign it to [node 2] which does not exist. I have masked some of the names in the logs in my original post, I apologize, here is the full unaltered logs :

{
  "index" : "sharedb",
  "node_allocation_decisions" : [
    {
      "node_name" : "e6f71666ef1c",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-10-30T11:06:31.920Z], failed_attempts[5], failed_nodes[[kn5WbDj1SNmnnzaEdYtAQA]], delayed=false, last_node[kn5WbDj1SNmnnzaEdYtAQA], details[failed shard on node [kn5WbDj1SNmnnzaEdYtAQA]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [sharedb][0]: Recovery failed on {e6f71666ef1c}{kn5WbDj1SNmnnzaEdYtAQA}{drcq8V_YQqmKLqImcGyrIA}{e6f71666ef1c}{172.18.0.2}{172.18.0.2:9300}{cdfhilmrstw}{8.13.2}{7000099-8503000}{ml.machine_memory=23622320128, ml.allocated_processors=32, ml.allocated_processors_double=32.0, ml.max_jvm_size=11811160064, ml.config_version=12.0.0, xpack.installed=true, transform.config_version=10.0.0}\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$36(IndexShard.java:3313)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$9(StoreRecovery.java:394)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.lambda$internalRecoverFromStore$12(StoreRecovery.java:498)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onFailure(ActionListenerImplementations.java:317)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:378)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:290)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:189)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:165)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:494)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:94)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2449)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:95)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: [sharedb/uYcf1bOUSgSEWY380No1JA][[sharedb][0]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway\n\t... 21 more\nCaused by: [sharedb/uYcf1bOUSgSEWY380No1JA][[sharedb][0]] org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:274)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:224)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:14)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.createEngine(IndexShard.java:2125)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2098)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2059)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.lambda$internalRecoverFromStore$10(StoreRecovery.java:485)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener$SuccessResult.complete(SubscribableListener.java:366)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:286)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.addListener(SubscribableListener.java:189)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.lambda$andThen$0(SubscribableListener.java:437)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.ActionListener.run(ActionListener.java:356)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.newForked(SubscribableListener.java:128)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.andThen(SubscribableListener.java:437)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.action.support.SubscribableListener.andThen(SubscribableListener.java:411)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:417)\n\t... 8 more\nCaused by: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /usr/share/elasticsearch/data/indices/uYcf1bOUSgSEWY380No1JA/0/index/write.lock\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:117)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:106)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:106)\n\tat org.apache.lucene.core@9.10.0/org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:953)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.createWriter(InternalEngine.java:2662)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.createWriter(InternalEngine.java:2650)\n\tat org.elasticsearch.server@8.13.2/org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:267)\n\t... 24 more\n], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}