Elastic unable to start after turning it off

pabloochoa · September 2, 2022, 9:11am

Hello,
I have a cluster of 3 ES nodes that use 7.16.3. All of them are both masters and data (as I am trying to deploy a minimal ES cluster that is resistant to the failure of one of the nodes). Maybe this is not a good configuration, so I would appreciate any suggestion.
The matter is that after stopping the respective containers in order to increase the minimum and maximum heap memory used (as it was only 2GB) and starting them again I kept getting the following errors:

{"type": "server", "timestamp": "2022-09-02T09:51:22,549+02:00", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "es-cluster", "node.name": "master1", "message": "[.geoip_databases][0] marking and sending shard failed due to [failed recovery]", "cluster.uuid": "25f5erfxQgWLxtbXwyuWXw", "node.id": "dgg-QOvASYaBzaKkaaRRzQ" ,
 "stacktrace": ["org.elasticsearch.indices.recovery.RecoveryFailedException: [.geoip_databases][0]: Recovery failed on {master1}{dgg-QOvASYaBzaKkaaRRzQ}{7-Z2Se51Rf27SkbF9sVdcQ}{10.0.3.227}{10.0.3.227:9300}{dm}{xpack.installed=true, transform.node=false}",
 "at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:3234) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:391) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:439) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:86) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2349) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.3.jar:7.16.3]",
 "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
 "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
 "at java.lang.Thread.run(Thread.java:833) [?:?]",
 "Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery",
 "... 11 more",
 "Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/usr/share/elasticsearch/data/nodes/0/indices/aCGlO11ERaWENCnSMOSOmQ/0/translog] is corrupted",
 "at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1891) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1878) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:1992) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2015) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:470) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:436) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "... 8 more",
 "Caused by: java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/nodes/0/indices/aCGlO11ERaWENCnSMOSOmQ/0/translog/translog-64.tlog",
 "at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
 "at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
 "at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
 "at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]",
 "at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]",
 "at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]",
 "at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1886) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1878) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:1992) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2015) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:470) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:436) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "... 8 more"] }

{"type": "server", "timestamp": "2022-09-02T09:51:27,153+02:00", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "es-cluster", "node.name": "master1", "message": "failing shard [failed shard, shard [.ds-.logs-deprecation.elasticsearch-default-2022.08.19-000003][0], node[dgg-QOvASYaBzaKkaaRRzQ], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=wkCEdz7SSYWhTeecmuArlw], unassigned_info[[reason=CLUSTER_RECOVERED], at[2022-09-02T07:51:18.157Z], delayed=false, allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[.ds-.logs-deprecation.elasticsearch-default-2022.08.19-000003][0]: Recovery failed on {master1}{dgg-QOvASYaBzaKkaaRRzQ}{7-Z2Se51Rf27SkbF9sVdcQ}{10.0.3.227}{10.0.3.227:9300}{dm}{xpack.installed=true, transform.node=false}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog] is corrupted]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog/translog-6.tlog]; ], markAsStale [true]]", "cluster.uuid": "25f5erfxQgWLxtbXwyuWXw", "node.id": "dgg-QOvASYaBzaKkaaRRzQ"

As it is shown in this last error message, the problem is that ES is looking for the translogs of other versions (6,9,...) but the real translogs versions are others (so the translog that exists for the case of the indice ulldHaPTS9WiZPZS7ItVWA is /usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog/translog-2.tlog];)

This is the first time it has happened, so I don't know how I am supposed to do. Data loss is not important although I would like to look for a solution that didn't imply loosing it.

pabloochoa · September 2, 2022, 9:20am

I am currently using ES as an index storage for Linkedin Datahub. I have noticed that Datahub internally uses frozen indices (although it is deprecated). Could that be related to the reason of this errors??

warkolm · September 6, 2022, 2:01am

That's the index it's failing on. You can delete it and restart Elasticsearch and it'll redownload what it needs to recreate it. Otherwise you can disable it entirely.

A deprecated feature won't cause this sort of error however. It's not clear what did, but sharing more logs might help.

pabloochoa · September 12, 2022, 9:18am

For some reason, when I put that information into another Elasticsearch stack with similar configuration and version, these errors dissapeared and the cluster was able to reach the green status without a problem.

In the event of a future recurrence, could you specify on how to delete the index mentioned??
As some errors ocurred, the ES cluster wasn't able to reach even the yellow state, so I don't know what could have been done with a red state cluster.

warkolm · September 12, 2022, 9:49am

Sure - Delete index API | Elasticsearch Guide [8.4] | Elastic

pabloochoa · September 14, 2022, 6:37am

I have tried it, as the error appeared again, but I get the following:

curl -X DELETE "localhost:9200/.geoip_databases?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Indices [.geoip_databases] use and access is reserved for system operations"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Indices [.geoip_databases] use and access is reserved for system operations"
  },
  "status" : 400
}

system · October 12, 2022, 6:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Search fails to restart because of high CPU usage and error failed to recover from translog Elasticsearch docker	1	816	July 30, 2020
Failed to start shard Elasticsearch	2	451	July 6, 2017
TranslogCorruptedException after restarting ES Elasticsearch	2	2901	July 5, 2017
ES failed to recover from translog corruption Elasticsearch	6	8123	December 17, 2018
Corrupted translog Elasticsearch	18	8340	June 27, 2017

Elastic unable to start after turning it off

Related topics