It all started when we terminated the the over-running snapshot restore process. after this when we tried to start the elastic search we are getting below error . could some please help , its a elastic search 1.3.4 version on linux
We are not in a position to start the fresh install - so any small hacks to get around the problem would be helpful.
Details from log - some data masked
[2016-03-29 08:24:12,543][WARN ][cluster.action.shard ]
[instance_9300] [default-index][3] received shard failed for [default-index][3],
node[XXXXXXXXX], [P], s[INITIALIZING], indexUUID [nR_XXXXXXX],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[default-index][3]
failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[default-index][3]
shard allocated for local recovery (post api), should exist, but doesn't, current files:
[_222.si, _2222.fdx, 1111.fnm, ----- list of all files ]];
nested: FileNotFoundException[No such file [_3yxri.si]]; ]]
I can do curl on the host and getting a response - but the shards are getting MARKED as UNASSIGNED . shard 3 is alone complaining and remaining shards looks OK. shard 3 is distributed on host 3 & host 4 which are continuously doing excessive logging and so we are keeping them down
our set-up is as below
total 4 hosts having each ES node
5 shards
$curl --fail -XGET 'http://localhost:9200/_cat/shards?pretty=true'
index 4 p STARTED 4336176 14.8gb IP1 host1_9300
index 4 r STARTED 4336176 14.8gb IP4 host4_9300
index 0 p STARTED 4336256 14gb IP1 host1_9300
index 0 r STARTED 4336256 14gb IP4 host4_9300
index 3 p INITIALIZING IP3 host3_9300
index 3 r UNASSIGNED
index 1 r STARTED 4340540 14.5gb IP2 host2_9300
index 1 p STARTED 4340540 14.5gb IP1 host1_9300
index 2 r STARTED 4333466 15.1gb IP2 host2_9300
index 2 p STARTED 4333466 15.1gb IP4 host4_9300
Hi mark- the log i have attached in the description is the content of log from host3 and it is complaining abt file not found exception
[2016-03-29 08:24:12,543][WARN ][cluster.action.shard ]
[instance_9300] [default-index][3] received shard failed for [default-index][3],
node[XXXXXXXXX], [P], s[INITIALIZING], indexUUID [nR_XXXXXXX],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[default-index][3]
failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[default-index][3]
shard allocated for local recovery (post api), should exist, but doesn't, current files:
[_222.si, _2222.fdx, 1111.fnm, ----- list of all files ]];
nested: FileNotFoundException[No such file [_3yxri.si]]; ]]
Opening index @ /data/host_9300/es_nameXXX/nodes/0/indices/default-index/3/index/
ERROR: could not read any segments file in directory
java.nio.file.NoSuchFileException: /data/host_9300/es_nameXXX/nodes/0/indices/default-index/3/index/_3XXXX.si
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113)
at org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:49)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:361)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:912)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:758)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:398)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)
We checked and data it is complaining and is not available in both primary & replica shards - looks like we lost some data while doing sync between prod & DR . we are OK to have some data loss and get the shard started some how and reason for that is as below
shard corruption is on Disaster recovery box and once the shard changes from UNASSIGNED -STARTED we will sync the snapshots from prod to dr box and the do a restore.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.