Elastic unable to start after turning it off

Hello,
I have a cluster of 3 ES nodes that use 7.16.3. All of them are both masters and data (as I am trying to deploy a minimal ES cluster that is resistant to the failure of one of the nodes). Maybe this is not a good configuration, so I would appreciate any suggestion.
The matter is that after stopping the respective containers in order to increase the minimum and maximum heap memory used (as it was only 2GB) and starting them again I kept getting the following errors:

{"type": "server", "timestamp": "2022-09-02T09:51:22,549+02:00", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "es-cluster", "node.name": "master1", "message": "[.geoip_databases][0] marking and sending shard failed due to [failed recovery]", "cluster.uuid": "25f5erfxQgWLxtbXwyuWXw", "node.id": "dgg-QOvASYaBzaKkaaRRzQ" ,
 "stacktrace": ["org.elasticsearch.indices.recovery.RecoveryFailedException: [.geoip_databases][0]: Recovery failed on {master1}{dgg-QOvASYaBzaKkaaRRzQ}{7-Z2Se51Rf27SkbF9sVdcQ}{10.0.3.227}{10.0.3.227:9300}{dm}{xpack.installed=true, transform.node=false}",
 "at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:3234) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:391) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:439) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:86) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2349) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.3.jar:7.16.3]",
 "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
 "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
 "at java.lang.Thread.run(Thread.java:833) [?:?]",
 "Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery",
 "... 11 more",
 "Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/usr/share/elasticsearch/data/nodes/0/indices/aCGlO11ERaWENCnSMOSOmQ/0/translog] is corrupted",
 "at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1891) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1878) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:1992) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2015) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:470) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:436) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "... 8 more",
 "Caused by: java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/nodes/0/indices/aCGlO11ERaWENCnSMOSOmQ/0/translog/translog-64.tlog",
 "at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
 "at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
 "at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
 "at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]",
 "at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]",
 "at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]",
 "at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1886) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1878) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:1992) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2015) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:470) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:436) ~[elasticsearch-7.16.3.jar:7.16.3]",
 "... 8 more"] }
{"type": "server", "timestamp": "2022-09-02T09:51:27,153+02:00", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "es-cluster", "node.name": "master1", "message": "failing shard [failed shard, shard [.ds-.logs-deprecation.elasticsearch-default-2022.08.19-000003][0], node[dgg-QOvASYaBzaKkaaRRzQ], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=wkCEdz7SSYWhTeecmuArlw], unassigned_info[[reason=CLUSTER_RECOVERED], at[2022-09-02T07:51:18.157Z], delayed=false, allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[.ds-.logs-deprecation.elasticsearch-default-2022.08.19-000003][0]: Recovery failed on {master1}{dgg-QOvASYaBzaKkaaRRzQ}{7-Z2Se51Rf27SkbF9sVdcQ}{10.0.3.227}{10.0.3.227:9300}{dm}{xpack.installed=true, transform.node=false}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog] is corrupted]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog/translog-6.tlog]; ], markAsStale [true]]", "cluster.uuid": "25f5erfxQgWLxtbXwyuWXw", "node.id": "dgg-QOvASYaBzaKkaaRRzQ"

As it is shown in this last error message, the problem is that ES is looking for the translogs of other versions (6,9,...) but the real translogs versions are others (so the translog that exists for the case of the indice ulldHaPTS9WiZPZS7ItVWA is /usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog/translog-2.tlog];)

This is the first time it has happened, so I don't know how I am supposed to do. Data loss is not important although I would like to look for a solution that didn't imply loosing it.

I am currently using ES as an index storage for Linkedin Datahub. I have noticed that Datahub internally uses frozen indices (although it is deprecated). Could that be related to the reason of this errors??

That's the index it's failing on. You can delete it and restart Elasticsearch and it'll redownload what it needs to recreate it. Otherwise you can disable it entirely.

A deprecated feature won't cause this sort of error however. It's not clear what did, but sharing more logs might help.

For some reason, when I put that information into another Elasticsearch stack with similar configuration and version, these errors dissapeared and the cluster was able to reach the green status without a problem.

In the event of a future recurrence, could you specify on how to delete the index mentioned??
As some errors ocurred, the ES cluster wasn't able to reach even the yellow state, so I don't know what could have been done with a red state cluster.

Sure - Delete index API | Elasticsearch Guide [8.4] | Elastic

I have tried it, as the error appeared again, but I get the following:

curl -X DELETE "localhost:9200/.geoip_databases?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Indices [.geoip_databases] use and access is reserved for system operations"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Indices [.geoip_databases] use and access is reserved for system operations"
  },
  "status" : 400
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.