Hello,
I have a cluster of 3 ES nodes that use 7.16.3. All of them are both masters and data (as I am trying to deploy a minimal ES cluster that is resistant to the failure of one of the nodes). Maybe this is not a good configuration, so I would appreciate any suggestion.
The matter is that after stopping the respective containers in order to increase the minimum and maximum heap memory used (as it was only 2GB) and starting them again I kept getting the following errors:
{"type": "server", "timestamp": "2022-09-02T09:51:22,549+02:00", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "es-cluster", "node.name": "master1", "message": "[.geoip_databases][0] marking and sending shard failed due to [failed recovery]", "cluster.uuid": "25f5erfxQgWLxtbXwyuWXw", "node.id": "dgg-QOvASYaBzaKkaaRRzQ" ,
"stacktrace": ["org.elasticsearch.indices.recovery.RecoveryFailedException: [.geoip_databases][0]: Recovery failed on {master1}{dgg-QOvASYaBzaKkaaRRzQ}{7-Z2Se51Rf27SkbF9sVdcQ}{10.0.3.227}{10.0.3.227:9300}{dm}{xpack.installed=true, transform.node=false}",
"at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:3234) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:391) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:439) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:86) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2349) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.3.jar:7.16.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery",
"... 11 more",
"Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/usr/share/elasticsearch/data/nodes/0/indices/aCGlO11ERaWENCnSMOSOmQ/0/translog] is corrupted",
"at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1891) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1878) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:1992) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2015) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:470) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:436) ~[elasticsearch-7.16.3.jar:7.16.3]",
"... 8 more",
"Caused by: java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/nodes/0/indices/aCGlO11ERaWENCnSMOSOmQ/0/translog/translog-64.tlog",
"at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]",
"at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]",
"at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]",
"at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1886) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1878) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:1992) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2015) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:470) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:436) ~[elasticsearch-7.16.3.jar:7.16.3]",
"... 8 more"] }
{"type": "server", "timestamp": "2022-09-02T09:51:27,153+02:00", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "es-cluster", "node.name": "master1", "message": "failing shard [failed shard, shard [.ds-.logs-deprecation.elasticsearch-default-2022.08.19-000003][0], node[dgg-QOvASYaBzaKkaaRRzQ], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=wkCEdz7SSYWhTeecmuArlw], unassigned_info[[reason=CLUSTER_RECOVERED], at[2022-09-02T07:51:18.157Z], delayed=false, allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[.ds-.logs-deprecation.elasticsearch-default-2022.08.19-000003][0]: Recovery failed on {master1}{dgg-QOvASYaBzaKkaaRRzQ}{7-Z2Se51Rf27SkbF9sVdcQ}{10.0.3.227}{10.0.3.227:9300}{dm}{xpack.installed=true, transform.node=false}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog] is corrupted]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog/translog-6.tlog]; ], markAsStale [true]]", "cluster.uuid": "25f5erfxQgWLxtbXwyuWXw", "node.id": "dgg-QOvASYaBzaKkaaRRzQ"
As it is shown in this last error message, the problem is that ES is looking for the translogs of other versions (6,9,...) but the real translogs versions are others (so the translog that exists for the case of the indice ulldHaPTS9WiZPZS7ItVWA
is /usr/share/elasticsearch/data/nodes/0/indices/ulldHaPTS9WiZPZS7ItVWA/0/translog/translog-2.tlog];
)
This is the first time it has happened, so I don't know how I am supposed to do. Data loss is not important although I would like to look for a solution that didn't imply loosing it.