see attached files:
http://elasticsearch-users.115913.n3.nabble.com/file/n861253/info.zip
Hi,
I am testing the behavior of elastic search under large scale:
in my setup there 2 64-bit nodes with 8 CPUs each.
they running version 0.71 of ES (as a service), and using an index
gateway
in fs mode (NFS)
I have a single index with 5 shards each shard has 2 replica (5/1) so
I have
10 shards in total.
I have indexed 8.5 Million documents
here is the disk usage of all shards and transLog in the gateway
6.0G ./0/index
4.4M ./0/translog
6.2G ./1/index
2.3M ./1/translog
6.4G ./2/index
2.8M ./2/translog
6.9G ./3/index
1.4M ./3/translog
6.5G ./4/index
2.6M ./4/translog
32G .
I then did the following things
- stopped the indexing process
- stopped one of the es nodes
- waited about 3 minutes
- stopped the other node
- restarted the first node
- queried for the number of docs: curl -XGET
'http://localhost:9200/en/_count?q=:'
I noticed the load on the machine was high (11-15, even now 30 minutes
after
the restart)
at first I got zero results
later after maybe 10 minutes I saw exceptions in the log ()see below)
and I got only 6.5 M docs - one shard is corrupted
I got a CorruptIndexException in the log file
Attached also :
http://elasticsearch-users.115913.n3.nabble.com/file/n861253/info.zip
info.zip the cluster health, state info, nodes info and the full log
file
07:23:33,508][WARN ][indices.cluster ] [Leeds, Betty Brant]
Failed to start shard for index [en] and shard id [3]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[en][3] Failed to perform recovery of translog
at
org.elasticsearch.index.gateway.fs.FsIndexShardGateway.recoverTranslog(FsIndexShardGateway.java:
381)
at
org.elasticsearch.index.gateway.fs.FsIndexShardGateway.recover(FsIndexShardGateway.java:
111)
at
org.elasticsearch.index.gateway.IndexShardGatewayService.recover(IndexShardGatewayService.java:
133)
at org.elasticsearch.indices.cluster.IndicesClusterStateService
$3.run(IndicesClusterStateService.java:342)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by:
org.elasticsearch.index.engine.EngineCreationFailureException: [en][3]
Failed to open reader on writer
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:
166)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecovery(InternalIndexShard.java:
407)
at
org.elasticsearch.index.gateway.fs.FsIndexShardGateway.recoverTranslog(FsIndexShardGateway.java:
378)
... 6 more
Caused by: org.apache.lucene.index.CorruptIndexException: doc counts
differ for segment _1mfa: fieldsReader shows 334 but segmentInfo shows
3215
at org.apache.lucene.index.SegmentReader
$CoreReaders.openDocStores(SegmentReader.java:282)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:578)
at org.apache.lucene.index.IndexWriter
$ReaderPool.get(IndexWriter.java:609)
at org.apache.lucene.index.IndexWriter
$ReaderPool.getReadOnlyClone(IndexWriter.java:568)
at
org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:
150)
at
org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:
36)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:
405)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:
372)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:
150)
... 8 more