Hi, our ES broke down automatically after running for a few days. We had to delete the indices first, then restart it to make it work normally again. But the problem still occurred after a few days. Would really appreciate it if anyone can help us solve the problem!
The ES version is 5.4.1. Operation system is Ubuntu 14.04.
Here is the log when the ES broke down:
[2019-12-07T01:04:08,770][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2019-12-07T01:04:38,793][WARN ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] high disk watermark [90%] exceeded on [xRIeFFvgTMes53cAJzhcYQ][107room-node-1][/alidata/server/elasticsearch/data/nodes/0] free: 1.3gb[6.6%], shards will be relocated away from this node
[2019-12-07T01:05:08,827][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-12-07T03:19:25,815][WARN ][o.e.i.e.Engine ] [107room-node-1] [107room][3] failed engine [already closed by tragic event on the translog]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
…
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,824][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [shard failure, reason [already closed by tragic event on the translog]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:?]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_161]
at org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:127) ~[elasticsearch-5.4.1.jar:5.4.1]
…
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,826][WARN ][o.e.c.a.s.ShardStateAction] [107room-node-1] [107room][3] received shard failed for shard id [[107room][3]], allocation id [MqwuitbpTweTbVPquCRzDg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [NoSuchFileException[/alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
…
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,864][INFO ][o.e.c.r.a.AllocationService] [107room-node-1] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[107room][3]] ...]).
[2019-12-07T03:19:25,974][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [107room][3]: Recovery failed on {107room-node-1}{xRIeFFvgTMes53cAJzhcYQ}{mElodbIhS96k-5uqnbX8WQ}{127.0.0.1}{127.0.0.1:9300}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1490) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:365) ~[elasticsearch-5.4.1.jar:5.4.1]
…
... 4 more
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:154) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.4.1.jar:5.4.1]
…
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
... 4 more
Caused by: java.nio.file.NoSuchFileException: /alidata/server/elasticsearch-5.4.1/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
…
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1238) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
... 4 more