ES 2.3.5, shard stuck in Translog stage

Our cluster became completely unresponsive to index requests (the requests would timeout), so I started a rolling-restart of each node. After I took down the third node down (out of 7), index requests began succeeding again. However, the service didn't immediately shut down. After an hour, I force-killed it. When I brought the service back up, the node rejoined the cluster, and all the shards initialized except for one, which is stuck in the translog stage. It's been in this stage for hours.

The cpu is at 100%

There's no activity in the logs, either. Just that it joined the cluster, found the master, and "started":

[2017-04-14 22:40:50,168][INFO ][node                     ] [es1-data03] version[2.3.5], pid[10875], build[90f439f/2016-07-27T10:36:52Z]
[2017-04-14 22:40:50,168][INFO ][node                     ] [es1-data03] initializing ...
[2017-04-14 22:40:50,954][INFO ][plugins                  ] [es1-data03] modules [lang-groovy, reindex, lang-expression], plugins [hq, graph, analysis-smartcn, marvel-agent, watcher, license, analysis-icu], sites [hq]
[2017-04-14 22:40:50,991][INFO ][env                      ] [es1-data03] using [1] data paths, mounts [[/mnt/storage (/dev/sdb)]], net usable_space [739.9gb], net total_space [999.8gb], spins? [no], types [xfs]
[2017-04-14 22:40:50,991][INFO ][env                      ] [es1-data03] heap size [7.2gb], compressed ordinary object pointers [true]
[2017-04-14 22:40:50,992][WARN ][env                      ] [es1-data03] max file descriptors [65535] for elasticsearch process likely too low, consider increasing to at least [65536]
[2017-04-14 22:40:51,055][INFO ][watcher.trigger.schedule ] [es1-data03] using [ticker] schedule trigger engine
[2017-04-14 22:40:53,209][WARN ][gateway                  ] [es1-data03] time setting [index.refresh_interval] with value [-1] is missing units; assuming default units (ms) but in future versions this will be a hard error
[2017-04-14 22:40:53,326][WARN ][gateway                  ] [es1-data03] time setting [index.refresh_interval] with value [-1] is missing units; assuming default units (ms) but in future versions this will be a hard error
[2017-04-14 22:40:53,367][INFO ][node                     ] [es1-data03] initialized
[2017-04-14 22:40:53,367][INFO ][node                     ] [es1-data03] starting ...
[2017-04-14 22:40:53,441][INFO ][transport                ] [es1-data03] publish_address {10.240.0.126:9300}, bound_addresses {10.240.0.126:9300}
[2017-04-14 22:40:53,446][INFO ][discovery                ] [es1-data03] es-cluster1/jRuHkDiST5yuvN9vnkxyVg
[2017-04-14 22:40:56,555][INFO ][cluster.service          ] [es1-data03] detected_master {es1-master01}{9VdCoKNVRdCNybO1d-NjKg}{10.240.0.120}{10.240.0.120:9300}{data=false, master=true}, added {{es1-data05}{IL1Pex6NRIy-mwKlkECn3g}{10.240.0.128}{10.240.0.128:9300}{master=false},{es1-data04}{Ujcdl23RQUWHHFixxGSzOQ}{10.240.0.127}{10.240.0.127:9300}{master=false},{es1-master02}{k-XJFbDLRK2H4VA_fp_MIQ}{10.240.0.121}{10.240.0.121:9300}{data=false, master=true},{es1-client01}{pJ5gWdFGQW6adP-ctkNsVQ}{10.240.0.118}{10.240.0.118:9300}{data=false, master=false},{es1-client02}{lMEJJcl3QAyTTplIax_uzw}{10.240.0.119}{10.240.0.119:9300}{data=false, master=false},{es1-master03}{05qPQUjkRQqMSdohypBnfA}{10.240.0.122}{10.240.0.122:9300}{data=false, master=true},{es1-data02}{2GQ-er5ZQDWpP6UOAQWO9Q}{10.240.0.125}{10.240.0.125:9300}{master=false},{es1-data07}{s-o1RJslSNCp6Hzknwx1rg}{10.240.0.130}{10.240.0.130:9300}{master=false},{es1-data01}{NhjjBgoTTa2NzD4i2r6Ktg}{10.240.0.124}{10.240.0.124:9300}{master=false},{es1-data06}{xeIcbqgxSr-iakoTuyG5_g}{10.240.0.129}{10.240.0.129:9300}{master=false},{es1-master01}{9VdCoKNVRdCNybO1d-NjKg}{10.240.0.120}{10.240.0.120:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{es1-master01}{9VdCoKNVRdCNybO1d-NjKg}{10.240.0.120}{10.240.0.120:9300}{data=false, master=true}])
[2017-04-14 22:40:56,595][INFO ][cluster.routing.allocation.decider] [es1-data03] updating [cluster.routing.allocation.disk.watermark.low] to [90%]
[2017-04-14 22:40:56,596][INFO ][indices.store            ] [es1-data03] updating indices.store.throttle.max_bytes_per_sec from [10gb] to [200mb], note, type is [NONE]
[2017-04-14 22:40:56,789][INFO ][license.plugin.core      ] [es1-data03] license [***] - valid
[2017-04-14 22:40:57,012][INFO ][http                     ] [es1-data03] publish_address {10.240.0.126:9200}, bound_addresses {10.240.0.126:9200}
[2017-04-14 22:40:57,013][INFO ][node                     ] [es1-data03] started
http es1-client01.c.fp:9200/_cat/recovery | grep translog
<index-name>                 49 2290691 store      translog 10.240.0.126 10.240.0.126 n/a n/a 0   100.0% 0           100.0% 212 14167448866 263   50.1%  525

OK, I just needed to wait longer. It just finished!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.