I am having serious problems with recovering my index after a power
failure. The setup I have is as follows:
- 1 master-only node
- 4 data-only node
- Indexed around 20mill documents, over 8 shards with replication factor 2.
After power-failure, I make sure to start the master node first.
Thereafter, I start all the data nodes as quickly as possible. When using
the HEAD plugin I can view the status as RED, and it literally takes hours
to recover. The document count will have to be much higher that 20 million
before I can take this into production.
I have ensured that the document limit (ulimit -n) is 32000 on all
machines, and I can have ElasticSearch verify this for me. When viewing
the logs, I get the following message repeatedly thrown on all of the data
nodes:
[2012-04-04 17:27:22,003][WARN ][indices.cluster ] [ClusterPC2]
[myindex][6] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[index][6] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:228)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[myindex][6] Failed to open reader on writer
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:279)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:579)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:175)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
... 3 more
Caused by: java.io.FileNotFoundException:
/data1/elasticsearch/elasticsearchcluster/nodes/0/indices/myindex/6/index/_nyo.tii
(Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:233)
at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:70)
at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:97)
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:92)
at
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:79)
at
org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:458)
at
org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:113)
at
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:76)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:115)
at
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:705)
at
org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:663)
at
org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:157)
at
org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:38)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:453)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:401)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:345)
at
org.elasticsearch.index.engine.robin.RobinEngine.buildNrtResource(RobinEngine.java:1365)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:263)
... 6 more
I litterally get thousands of these entries in the logs, several Gigs worth
if I don't stop the process. On the master PC I get this message:
[2012-04-04 17:29:31,613][WARN ][cluster.action.shard ] [ClusterPC2]
received shard failed for [myindex][5], node[ufvdnrd-QMm0eSEBNowUag], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[myindex][5] failed recovery]; nested:
EngineCreationFailureException[[myindex][5] Failed to open reader on
writer]; nested:
FileNotFoundException[/data2/elasticsearch/elasticsearchcluster/nodes/0/indices/myindex/5/index/_91x.prx
(Too many open files)]; ]]
To solve the problem, I have to restart all the data node services (I use
the service wrapper), so I enter "service elasticsearch restart" on all my
data node PC's. Very soon after this, the HEAD plugin shows status as
yellow, and shards are being initialised and copied. This process still
takes several hours though, and I suspect that's not normal. Is it
possible that through that initial frenzy when the "Files could not be
found" the structure somehow got corrupted? Each shard is about 16GB in
size at the moment, though it will surely grow quite a bit still.
Any advice? Hopefully I'm doing something wrong here...
Thanks, Thinus