How to restart/recover a shard?

Grant_Rodgers · August 20, 2010, 6:19pm

FWIW, I've been having similar issues with my cluster in the past few
days. Some shards get stuck recovering and never make progress.

I have tried several different HEADs, and the first one that worked
was b785979. I deleted all the work directories and restarted the
cluster, and it recovered all primary shards in about 30 minutes, and
all replicas about 20 minutes later. This is about 90 gigs worth of
index recovered in less than an hour. Fortunately our gateway is not
corrupted.

On Aug 20, 10:30 am, Kenneth Loafman kenneth.loaf...@gmail.com
wrote:

Probably need some better error detection...

Thanks for the help.

...Ken

Shay Banon wrote:

It means that the gateway store got corrupted. You will have to rebuild
the index. Probably due to all HEAD changes... . Hopefully its getting
stable now.

-shay.banon

On Fri, Aug 20, 2010 at 8:13 PM, Kenneth Loafman
<kenneth.loaf...@gmail.com mailto:kenneth.loaf...@gmail.com> wrote:

Looks like a file may be missing on the gateway... this repeats in the
log over and over.

[12:10:00,597][WARN ][indices.cluster          ] [Magilla] [twitter][1]
failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] Failed to recover translog
       at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recove rTranslog(BlobStoreIndexShardGateway.java:516)
       at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recove r(BlobStoreIndexShardGateway.java:417)
       at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGa tewayService.java:172)
       at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 110)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 603)
       at java.lang.Thread.run(Thread.java:636)
Caused by:
org.elasticsearch.index.engine.EngineCreationFailureException:
[twitter][1] Failed to open reader on writer
       at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:171 )
       at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPre pareForTranslog(InternalIndexShard.java:405)
       at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recove rTranslog(BlobStoreIndexShardGateway.java:440)
       ... 5 more
Caused by: java.io.FileNotFoundException:
/mnt/search-data-dev/elasticsearch/nodes/1/indices/twitter/1/index/_d8g.cfs
(No such file or directory)
       at java.io.RandomAccessFile.open(Native Method)
       at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
       at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.<in it>(SimpleFSDirectory.java:76)
       at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.<init>(SimpleF SDirectory.java:97)
       at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.<init>(NIOFSDirector y.java:87)
       at
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:67)
       at
org.elasticsearch.index.store.support.AbstractStore$StoreDirectory.openInpu t(AbstractStore.java:287)
       at
org.apache.lucene.index.CompoundFileReader.<init>(CompoundFileReader.java:6 7)
       at
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java :114)
       at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:590)
       at
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:616)
       at
org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter .java:574)
       at
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:150)
       at
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryRea der.java:36)
       at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:410)
       at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:374)
       at
org.elasticsearch.index.engine.robin.RobinEngine.buildNrtResource(RobinEngi ne.java:538)
       at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:158 )
       ... 7 more
[12:10:00,605][WARN ][cluster.action.shard     ] [Magilla] sending
failed shard for [twitter][1],
node[10dab323-019b-4036-854f-89bb068dcc8d], [P], s[INITIALIZING], reason
[Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] Failed to recover
translog]; nested: EngineCreationFailureException[[twitter][1] Failed to
open reader on writer]; nested:
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/1/indices/tw itter/1/index/_d8g.cfs
(No such file or directory)]; ]]

Shay Banon wrote:
> Also, use the latest again, pushed some more fixes.

> On Fri, Aug 20, 2010 at 8:04 PM, Shay Banon
> <shay.ba...@elasticsearch.com
<mailto:shay.ba...@elasticsearch.com>
<mailto:shay.ba...@elasticsearch.com
<mailto:shay.ba...@elasticsearch.com>>> wrote:

>     Do you see any exceptions in the logs (failing to start the
shard)?

>     On Fri, Aug 20, 2010 at 8:02 PM, Kenneth Loafman
>     <kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>
<mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>>> wrote:

>         Now its looping:  progress is going to 100, then starting
over.

>         I set up a 1/second loop using:
>          while /bin/true; do date; curl -XGET
>         'http://192.168.1.5:9200/twitter/_status?pretty=true';
sleep 1; done
>         then copied it to gist at:http://gist.github.com/540711

>         It should have recovered by now, I would think.

>         ...Ken

>         Shay Banon wrote:
>         > great, ping me if it does not end, I am here to help (we can
>         make it
>         > more interactive on IRC).

>         > p.s. Can you keep the original json format when you
gist? Much
>         easier to
>         > know whats going on. You can add pretty=true as a
parameter to
>         get it
>         > pretty printed.

>         > -shay.banon

>         > On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman
>         > <kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com> <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>>
>         <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>
>         <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>>>> wrote:

>         >     I think so... Here's the latest on gist
>        http://gist.github.com/540471

>         >     Thanks for the pointer on gist, I've never used it
before.

>         >     Shay Banon wrote:
>         >     > The top just states which shards were queries, a shard
>         that is
>         >     still not
>         >     > allocated will obviously not be allocated. It
seems like
>         its still in
>         >     > recovery process. There are two main APIs to really
>         understand what is
>         >     > going on (except for the high level health api), the
>         cluster state
>         >     API,
>         >     > that shows you what the cluster wide state is (where
>         each shard is
>         >     > supposed to be, what its state is), and the status api
>         which gives you
>         >     > detailed information of the status of each shard
>         allocated on each
>         >     node.

>         >     > Is the recovery progressing?

>         >     > p.s. Can you use gist instead of pastebin?

>         >     > -shay.banon

>         >     > On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
>         >     > <kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>
>         <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>>
>         <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>
>         <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>>>
>         >     <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>
>         <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>>
>         >     <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>
>         <mailto:kenneth.loaf...@gmail.com
<mailto:kenneth.loaf...@gmail.com>>>>> wrote:

>         >     >     I restarted and now 35 of 36 are successful,
but if
>         you look
>         >     at the
>         >     >     status, it's showing multiple shards in recovery.
>          I'm confused.

>         >     >     See cluster status inhttp://pastebin.com/9qWLf3mk

>         >     >     Kenneth Loafman wrote:
>         >     >     > Will do so in just a bit...

>         >     >     > Shay Banon wrote:
>         >     >     >> ... can you test?

>         >     >     >> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
>         >     >     >> <shay.ba...@elasticsearch.com
<mailto:shay.ba...@elasticsearch.com>
>         <mailto:shay.ba...@elasticsearch.com
<mailto:shay.ba...@elasticsearch.com>>
>         >     <mailto:shay.ba...@elasticsearch.com
<mailto:shay.ba...@elasticsearch.com>
>         <mailto:shay.ba...@elasticsearch.com
<mailto:shay.ba...@elasticsearch.com>>>
>         >     >     <mailto:shay.ba...@elasticsearch.com...

Topic		Replies	Views
Not recovering all shards, why? Elasticsearch	3	347	July 6, 2017
Constant Recovering and Unassigned shards for an index Elasticsearch	12	1041	July 6, 2017
ES Ate My Shards/Indexes Elasticsearch	13	589	July 6, 2017
BroadcastShardOperationFailedException Elasticsearch	13	903	July 6, 2017
0.19.10 - cluster wedged, most operations failing Elasticsearch	4	490	July 6, 2017

How to restart/recover a shard?

Related topics