【HELP】All shards failed regularly

weiwei107 · December 9, 2019, 4:10am

Hi, our ES broke down automatically after running for a few days. We had to delete the indices first, then restart it to make it work normally again. But the problem still occurred after a few days. Would really appreciate it if anyone can help us solve the problem!

The ES version is 5.4.1. Operation system is Ubuntu 14.04.

Here is the log when the ES broke down:

[2019-12-07T01:04:08,770][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2019-12-07T01:04:38,793][WARN ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] high disk watermark [90%] exceeded on [xRIeFFvgTMes53cAJzhcYQ][107room-node-1][/alidata/server/elasticsearch/data/nodes/0] free: 1.3gb[6.6%], shards will be relocated away from this node
[2019-12-07T01:05:08,827][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-12-07T03:19:25,815][WARN ][o.e.i.e.Engine           ] [107room-node-1] [107room][3] failed engine [already closed by tragic event on the translog]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	…
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,824][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [shard failure, reason [already closed by tragic event on the translog]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:?]
	at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_161]
	at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_161]
	at org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:127) ~[elasticsearch-5.4.1.jar:5.4.1]
	…
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,826][WARN ][o.e.c.a.s.ShardStateAction] [107room-node-1] [107room][3] received shard failed for shard id [[107room][3]], allocation id [MqwuitbpTweTbVPquCRzDg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [NoSuchFileException[/alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	…
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,864][INFO ][o.e.c.r.a.AllocationService] [107room-node-1] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[107room][3]] ...]).
[2019-12-07T03:19:25,974][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [107room][3]: Recovery failed on {107room-node-1}{xRIeFFvgTMes53cAJzhcYQ}{mElodbIhS96k-5uqnbX8WQ}{127.0.0.1}{127.0.0.1:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1490) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:365) ~[elasticsearch-5.4.1.jar:5.4.1]
	…
	... 4 more
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:154) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.4.1.jar:5.4.1]
	…
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 4 more
Caused by: java.nio.file.NoSuchFileException: /alidata/server/elasticsearch-5.4.1/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	…
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1238) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 4 more

Christian_Dahlqvist · December 9, 2019, 6:07am

What type of storage are you using?

weiwei107 · December 9, 2019, 6:13am

Sorry, could you please specify your question a bit more? I am a newbie at ES. Thanks!

Christian_Dahlqvist · December 9, 2019, 6:21am

Are your Elasticsearch nodes using local disks or some other type of storage? What type of hardware is your cluster deployed on? How much data and shards do you have in the cluster?

weiwei107 · December 9, 2019, 6:55am

Local disk. It is put on a Aliyun web server (similar to AWS) with Ubuntu. The total storage size at the partition where ES is put is 20G. It seems we use the default setting of shard number as we didn't change it at the yml file.

The details of the data storage cost is pasted below.

root@/alidata/server/elasticsearch# du -h

4.4M ./modules/lang-groovy

560K ./modules/lang-expression

1.2M ./modules/lang-painless

2.0M ./modules/reindex

180K ./modules/lang-mustache

1.4M ./modules/transport-netty3

112K ./modules/percolator

60K ./modules/aggs-matrix-stats

1.8M ./modules/ingest-common

2.5M ./modules/transport-netty4

14M ./modules

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1/translog

279M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1/_state

279M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/1

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3/translog

277M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3/_state

277M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/3

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2/translog

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2/_state

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/2

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4/translog

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4/_state

275M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/4

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0/translog

276M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0/index

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0/_state

276M ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/0

8.0K ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA/_state

1.4G ./data/nodes/0/indices/fPDin4URS1a5xU1zSN56qA

1.4G ./data/nodes/0/indices

12K ./data/nodes/0/_state

1.4G ./data/nodes/0

1.4G ./data/nodes

1.4G ./data

22M ./lib

139M ./logs

348K ./bin

4.0K ./config/scripts

24K ./config

4.0K ./plugins

1.6G .

weiwei107 · December 9, 2019, 7:01am

Btw, our website also runs on the same partition of the server. The log files it produces will consume the storage space to over 90%. What we did is to delete those logs files manually once we received the notice msg of disk storage space consumption over 90%. Not sure if this is the cause of ES breakdown.

weiwei107 · December 9, 2019, 8:26am

The same problem occurred again. Total disk storage consumption is around 60% when it occurred. I had to "rm -r *" the indices and then restart it.

Below is some additional info when it runs normally.
root@/deploy/107room# curl -XGET "http://localhost:9200/_cluster/health?pretty"
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

Christian_Dahlqvist · December 9, 2019, 9:41am

If this is hosted in the cloud, how can you be sure local disk is used?

weiwei107 · December 9, 2019, 9:44am

In this sense, not sure. Maybe I was wrong about this.

Janko · December 9, 2019, 10:26pm

The disk is running full and hitting the watermarks as you acknowledge and that does not allow any more data to be written to your Elasticsearch cluster.
As it only consists in 1 single node there is no way to move data off of that node and continue working (https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html). In general it is not advised to share resources such as your website and the Elasticsearch service on the same node. If this is an important service the Elasticsearch cluster should have more nodes and I would also advice for more disk space as to not run out quickly as this server does now.

weiwei107 · December 10, 2019, 2:58am

Thanks a lot for your explanation! Will try to fix it as advised.

weiwei107 · December 10, 2019, 7:38am

Hi, the problem occurred again when the disk usage is only 67%. According to logs, the cause is still the NoSuchFileException of the "translog.ckp" file. Why this file disappeared automatically? It seems there is no hitting watermarks note this time.

We haven't increased nodes yet. Still researching how to do so.

system · January 7, 2020, 7:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Frequent shard failures Elasticsearch	7	743	July 20, 2023
Stability of latest Elasticsearch? Elasticsearch	7	1025	July 5, 2017
ELK It's broken, rerouting shards:high disk watermark exceeded on one or more nodes Elasticsearch	7	4927	August 19, 2018
ES nodes crashing: failed to send failed shard Elasticsearch	6	2523	July 5, 2017
All shards failing error Elasticsearch	4	1642	April 3, 2021

【HELP】All shards failed regularly

Related topics