FWIW, I've been having similar issues with my cluster in the past few
days. Some shards get stuck recovering and never make progress.
I have tried several different HEADs, and the first one that worked
was b785979. I deleted all the work directories and restarted the
cluster, and it recovered all primary shards in about 30 minutes, and
all replicas about 20 minutes later. This is about 90 gigs worth of
index recovered in less than an hour. Fortunately our gateway is not
corrupted.
On Aug 20, 10:30 am, Kenneth Loafman kenneth.loaf...@gmail.com
wrote:
Probably need some better error detection...
Thanks for the help.
...Ken
Shay Banon wrote:
It means that the gateway store got corrupted. You will have to rebuild
the index. Probably due to all HEAD changes... . Hopefully its getting
stable now.-shay.banon
On Fri, Aug 20, 2010 at 8:13 PM, Kenneth Loafman
<kenneth.loaf...@gmail.com mailto:kenneth.loaf...@gmail.com> wrote:Looks like a file may be missing on the gateway... this repeats in the log over and over.[12:10:00,597][WARN ][indices.cluster ] [Magilla] [twitter][1] failed to start shard org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [twitter][1] Failed to recover translog at org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recove rTranslog(BlobStoreIndexShardGateway.java:516) at org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recove r(BlobStoreIndexShardGateway.java:417) at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGa tewayService.java:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 603) at java.lang.Thread.run(Thread.java:636) Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: [twitter][1] Failed to open reader on writer at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:171 ) at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPre pareForTranslog(InternalIndexShard.java:405) at org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recove rTranslog(BlobStoreIndexShardGateway.java:440) ... 5 more Caused by: java.io.FileNotFoundException: /mnt/search-data-dev/elasticsearch/nodes/1/indices/twitter/1/index/_d8g.cfs (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.<in it>(SimpleFSDirectory.java:76) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.<init>(SimpleF SDirectory.java:97) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.<init>(NIOFSDirector y.java:87) at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:67) at org.elasticsearch.index.store.support.AbstractStore$StoreDirectory.openInpu t(AbstractStore.java:287) at org.apache.lucene.index.CompoundFileReader.<init>(CompoundFileReader.java:6 7) at org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java :114) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:590) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:616) at org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter .java:574) at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:150) at org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryRea der.java:36) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:410) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:374) at org.elasticsearch.index.engine.robin.RobinEngine.buildNrtResource(RobinEngi ne.java:538) at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:158 ) ... 7 more [12:10:00,605][WARN ][cluster.action.shard ] [Magilla] sending failed shard for [twitter][1], node[10dab323-019b-4036-854f-89bb068dcc8d], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[twitter][1] Failed to recover translog]; nested: EngineCreationFailureException[[twitter][1] Failed to open reader on writer]; nested: FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/1/indices/tw itter/1/index/_d8g.cfs (No such file or directory)]; ]]Shay Banon wrote: > Also, use the latest again, pushed some more fixes.> On Fri, Aug 20, 2010 at 8:04 PM, Shay Banon > <shay.ba...@elasticsearch.com <mailto:shay.ba...@elasticsearch.com> <mailto:shay.ba...@elasticsearch.com <mailto:shay.ba...@elasticsearch.com>>> wrote:> Do you see any exceptions in the logs (failing to start the shard)?> On Fri, Aug 20, 2010 at 8:02 PM, Kenneth Loafman > <kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com> <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>>> wrote:> Now its looping: progress is going to 100, then starting over.> I set up a 1/second loop using: > while /bin/true; do date; curl -XGET > 'http://192.168.1.5:9200/twitter/_status?pretty=true'; sleep 1; done > then copied it to gist at:http://gist.github.com/540711> It should have recovered by now, I would think.> ...Ken> Shay Banon wrote: > > great, ping me if it does not end, I am here to help (we can > make it > > more interactive on IRC).> > p.s. Can you keep the original json format when you gist? Much > easier to > > know whats going on. You can add pretty=true as a parameter to > get it > > pretty printed.> > -shay.banon> > On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman > > <kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com> <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>> > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com> > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>>>> wrote:> > I think so... Here's the latest on gist > http://gist.github.com/540471> > Thanks for the pointer on gist, I've never used it before.> > Shay Banon wrote: > > > The top just states which shards were queries, a shard > that is > > still not > > > allocated will obviously not be allocated. It seems like > its still in > > > recovery process. There are two main APIs to really > understand what is > > > going on (except for the high level health api), the > cluster state > > API, > > > that shows you what the cluster wide state is (where > each shard is > > > supposed to be, what its state is), and the status api > which gives you > > > detailed information of the status of each shard > allocated on each > > node.> > > Is the recovery progressing?> > > p.s. Can you use gist instead of pastebin?> > > -shay.banon> > > On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman > > > <kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com> > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>> > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com> > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>>> > > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com> > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>> > > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com> > <mailto:kenneth.loaf...@gmail.com <mailto:kenneth.loaf...@gmail.com>>>>> wrote:> > > I restarted and now 35 of 36 are successful, but if > you look > > at the > > > status, it's showing multiple shards in recovery. > I'm confused.> > > See cluster status inhttp://pastebin.com/9qWLf3mk> > > Kenneth Loafman wrote: > > > > Will do so in just a bit...> > > > Shay Banon wrote: > > > >> ... can you test?> > > >> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon > > > >> <shay.ba...@elasticsearch.com <mailto:shay.ba...@elasticsearch.com> > <mailto:shay.ba...@elasticsearch.com <mailto:shay.ba...@elasticsearch.com>> > > <mailto:shay.ba...@elasticsearch.com <mailto:shay.ba...@elasticsearch.com> > <mailto:shay.ba...@elasticsearch.com <mailto:shay.ba...@elasticsearch.com>>> > > > <mailto:shay.ba...@elasticsearch.com...read more »