【HELP】All shards failed regularly

Hi, our ES broke down automatically after running for a few days. We had to delete the indices first, then restart it to make it work normally again. But the problem still occurred after a few days. Would really appreciate it if anyone can help us solve the problem!

The ES version is 5.4.1. Operation system is Ubuntu 14.04.

Here is the log when the ES broke down:

[2019-12-07T01:04:08,770][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2019-12-07T01:04:38,793][WARN ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] high disk watermark [90%] exceeded on [xRIeFFvgTMes53cAJzhcYQ][107room-node-1][/alidata/server/elasticsearch/data/nodes/0] free: 1.3gb[6.6%], shards will be relocated away from this node
[2019-12-07T01:05:08,827][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-12-07T03:19:25,815][WARN ][o.e.i.e.Engine           ] [107room-node-1] [107room][3] failed engine [already closed by tragic event on the translog]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	…
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,824][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [shard failure, reason [already closed by tragic event on the translog]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:?]
	at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_161]
	at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_161]
	at org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:127) ~[elasticsearch-5.4.1.jar:5.4.1]
	…
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,826][WARN ][o.e.c.a.s.ShardStateAction] [107room-node-1] [107room][3] received shard failed for shard id [[107room][3]], allocation id [MqwuitbpTweTbVPquCRzDg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [NoSuchFileException[/alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	…
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,864][INFO ][o.e.c.r.a.AllocationService] [107room-node-1] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[107room][3]] ...]).
[2019-12-07T03:19:25,974][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [107room][3]: Recovery failed on {107room-node-1}{xRIeFFvgTMes53cAJzhcYQ}{mElodbIhS96k-5uqnbX8WQ}{127.0.0.1}{127.0.0.1:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1490) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:365) ~[elasticsearch-5.4.1.jar:5.4.1]
	…
	... 4 more
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:154) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.4.1.jar:5.4.1]
	…
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 4 more
Caused by: java.nio.file.NoSuchFileException: /alidata/server/elasticsearch-5.4.1/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	…
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1238) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 4 more

What type of storage are you using?

Sorry, could you please specify your question a bit more? I am a newbie at ES. Thanks!

Are your Elasticsearch nodes using local disks or some other type of storage? What type of hardware is your cluster deployed on? How much data and shards do you have in the cluster?

Local disk. It is put on a Aliyun web server (similar to AWS) with Ubuntu. The total storage size at the partition where ES is put is 20G. It seems we use the default setting of shard number as we didn't change it at the yml file.

The details of the data storage cost is pasted below.

root@/alidata/server/elasticsearch# du -h

4.4M ./modules/lang-groovy

560K ./modules/lang-expression

1.2M ./modules/lang-painless

2.0M ./modules/reindex

180K ./modules/lang-mustache

1.4M ./modules/transport-netty3

112K ./modules/percolator

60K ./modules/aggs-matrix-stats

1.8M ./modules/ingest-common

2.5M ./modules/transport-netty4

14M ./modules

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1/translog

279M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1/_state

279M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3/translog

277M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3/_state

277M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2/translog

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2/_state

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4/translog

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4/_state

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0/translog

276M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0/_state

276M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/_state

1.4G ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA

1.4G ./data/nodes/0/indices

12K ./data/nodes/0/_state

1.4G ./data/nodes/0

1.4G ./data/nodes

1.4G ./data

22M ./lib

139M ./logs

348K ./bin

4.0K ./config/scripts

24K ./config

4.0K ./plugins

1.6G .

Btw, our website also runs on the same partition of the server. The log files it produces will consume the storage space to over 90%. What we did is to delete those logs files manually once we received the notice msg of disk storage space consumption over 90%. Not sure if this is the cause of ES breakdown.

The same problem occurred again. Total disk storage consumption is around 60% when it occurred. I had to "rm -r *" the indices and then restart it.

Below is some additional info when it runs normally.
root@/deploy/107room# curl -XGET "http://localhost:9200/_cluster/health?pretty"
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

If this is hosted in the cloud, how can you be sure local disk is used?

In this sense, not sure. Maybe I was wrong about this.

The disk is running full and hitting the watermarks as you acknowledge and that does not allow any more data to be written to your Elasticsearch cluster.
As it only consists in 1 single node there is no way to move data off of that node and continue working (https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html). In general it is not advised to share resources such as your website and the Elasticsearch service on the same node. If this is an important service the Elasticsearch cluster should have more nodes and I would also advice for more disk space as to not run out quickly as this server does now.

1 Like

Thanks a lot for your explanation! Will try to fix it as advised.

Hi, the problem occurred again when the disk usage is only 67%. According to logs, the cause is still the NoSuchFileException of the "translog.ckp" file. Why this file disappeared automatically? It seems there is no hitting watermarks note this time.

We haven't increased nodes yet. Still researching how to do so.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.