ES 2.3.5, shard stuck in Translog stage

drs · April 14, 2017, 11:20pm

Our cluster became completely unresponsive to index requests (the requests would timeout), so I started a rolling-restart of each node. After I took down the third node down (out of 7), index requests began succeeding again. However, the service didn't immediately shut down. After an hour, I force-killed it. When I brought the service back up, the node rejoined the cluster, and all the shards initialized except for one, which is stuck in the translog stage. It's been in this stage for hours.

The cpu is at 100%

There's no activity in the logs, either. Just that it joined the cluster, found the master, and "started":

[2017-04-14 22:40:50,168][INFO ][node                     ] [es1-data03] version[2.3.5], pid[10875], build[90f439f/2016-07-27T10:36:52Z]
[2017-04-14 22:40:50,168][INFO ][node                     ] [es1-data03] initializing ...
[2017-04-14 22:40:50,954][INFO ][plugins                  ] [es1-data03] modules [lang-groovy, reindex, lang-expression], plugins [hq, graph, analysis-smartcn, marvel-agent, watcher, license, analysis-icu], sites [hq]
[2017-04-14 22:40:50,991][INFO ][env                      ] [es1-data03] using [1] data paths, mounts [[/mnt/storage (/dev/sdb)]], net usable_space [739.9gb], net total_space [999.8gb], spins? [no], types [xfs]
[2017-04-14 22:40:50,991][INFO ][env                      ] [es1-data03] heap size [7.2gb], compressed ordinary object pointers [true]
[2017-04-14 22:40:50,992][WARN ][env                      ] [es1-data03] max file descriptors [65535] for elasticsearch process likely too low, consider increasing to at least [65536]
[2017-04-14 22:40:51,055][INFO ][watcher.trigger.schedule ] [es1-data03] using [ticker] schedule trigger engine
[2017-04-14 22:40:53,209][WARN ][gateway                  ] [es1-data03] time setting [index.refresh_interval] with value [-1] is missing units; assuming default units (ms) but in future versions this will be a hard error
[2017-04-14 22:40:53,326][WARN ][gateway                  ] [es1-data03] time setting [index.refresh_interval] with value [-1] is missing units; assuming default units (ms) but in future versions this will be a hard error
[2017-04-14 22:40:53,367][INFO ][node                     ] [es1-data03] initialized
[2017-04-14 22:40:53,367][INFO ][node                     ] [es1-data03] starting ...
[2017-04-14 22:40:53,441][INFO ][transport                ] [es1-data03] publish_address {10.240.0.126:9300}, bound_addresses {10.240.0.126:9300}
[2017-04-14 22:40:53,446][INFO ][discovery                ] [es1-data03] es-cluster1/jRuHkDiST5yuvN9vnkxyVg
[2017-04-14 22:40:56,555][INFO ][cluster.service          ] [es1-data03] detected_master {es1-master01}{9VdCoKNVRdCNybO1d-NjKg}{10.240.0.120}{10.240.0.120:9300}{data=false, master=true}, added {{es1-data05}{IL1Pex6NRIy-mwKlkECn3g}{10.240.0.128}{10.240.0.128:9300}{master=false},{es1-data04}{Ujcdl23RQUWHHFixxGSzOQ}{10.240.0.127}{10.240.0.127:9300}{master=false},{es1-master02}{k-XJFbDLRK2H4VA_fp_MIQ}{10.240.0.121}{10.240.0.121:9300}{data=false, master=true},{es1-client01}{pJ5gWdFGQW6adP-ctkNsVQ}{10.240.0.118}{10.240.0.118:9300}{data=false, master=false},{es1-client02}{lMEJJcl3QAyTTplIax_uzw}{10.240.0.119}{10.240.0.119:9300}{data=false, master=false},{es1-master03}{05qPQUjkRQqMSdohypBnfA}{10.240.0.122}{10.240.0.122:9300}{data=false, master=true},{es1-data02}{2GQ-er5ZQDWpP6UOAQWO9Q}{10.240.0.125}{10.240.0.125:9300}{master=false},{es1-data07}{s-o1RJslSNCp6Hzknwx1rg}{10.240.0.130}{10.240.0.130:9300}{master=false},{es1-data01}{NhjjBgoTTa2NzD4i2r6Ktg}{10.240.0.124}{10.240.0.124:9300}{master=false},{es1-data06}{xeIcbqgxSr-iakoTuyG5_g}{10.240.0.129}{10.240.0.129:9300}{master=false},{es1-master01}{9VdCoKNVRdCNybO1d-NjKg}{10.240.0.120}{10.240.0.120:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{es1-master01}{9VdCoKNVRdCNybO1d-NjKg}{10.240.0.120}{10.240.0.120:9300}{data=false, master=true}])
[2017-04-14 22:40:56,595][INFO ][cluster.routing.allocation.decider] [es1-data03] updating [cluster.routing.allocation.disk.watermark.low] to [90%]
[2017-04-14 22:40:56,596][INFO ][indices.store            ] [es1-data03] updating indices.store.throttle.max_bytes_per_sec from [10gb] to [200mb], note, type is [NONE]
[2017-04-14 22:40:56,789][INFO ][license.plugin.core      ] [es1-data03] license [***] - valid
[2017-04-14 22:40:57,012][INFO ][http                     ] [es1-data03] publish_address {10.240.0.126:9200}, bound_addresses {10.240.0.126:9200}
[2017-04-14 22:40:57,013][INFO ][node                     ] [es1-data03] started

http es1-client01.c.fp:9200/_cat/recovery | grep translog
<index-name>                 49 2290691 store      translog 10.240.0.126 10.240.0.126 n/a n/a 0   100.0% 0           100.0% 212 14167448866 263   50.1%  525

drs · April 14, 2017, 11:42pm

OK, I just needed to wait longer. It just finished!

system · May 12, 2017, 11:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES 2.1 shards stuck in translog recovery Elasticsearch	14	6024	July 5, 2017
Translog recovery stuck[ES 6.0] Elasticsearch	5	1281	September 30, 2019
EL7.11 Index recoveries stuck at translog stage Elasticsearch	1	583	April 7, 2021
Elasticesearch 7.6.2 translog overflow issue Elasticsearch	10	775	December 28, 2020
Shard stuck in STORE TRANSLOG stage Elasticsearch	4	1233	July 5, 2017

ES 2.3.5, shard stuck in Translog stage

Related topics