SOLVED - ELASTICSEARCH - Unable to start elastic search service on linux

It all started when we terminated the the over-running snapshot restore process. after this when we tried to start the elastic search we are getting below error . could some please help , its a elastic search 1.3.4 version on linux

We are not in a position to start the fresh install - so any small hacks to get around the problem would be helpful.

Details from log - some data masked

[2016-03-29 08:24:12,543][WARN ][cluster.action.shard ]
[instance_9300] [default-index][3] received shard failed for [default-index][3],
node[XXXXXXXXX], [P], s[INITIALIZING], indexUUID [nR_XXXXXXX],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[default-index][3]
failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[default-index][3]
shard allocated for local recovery (post api), should exist, but doesn't, current files:
[_222.si, _2222.fdx, 1111.fnm, ----- list of all files ]];
nested: FileNotFoundException[No such file [_3yxri.si]]; ]]

Based on that log it looks like ES has started.
Can you not curl the host on 9200?

Mark - thanks for your response.

I can do curl on the host and getting a response - but the shards are getting MARKED as UNASSIGNED . shard 3 is alone complaining and remaining shards looks OK. shard 3 is distributed on host 3 & host 4 which are continuously doing excessive logging and so we are keeping them down

our set-up is as below

  1. total 4 hosts having each ES node
  2. 5 shards

$curl --fail -XGET 'http://localhost:9200/_cat/shards?pretty=true'
index 4 p STARTED 4336176 14.8gb IP1 host1_9300
index 4 r STARTED 4336176 14.8gb IP4 host4_9300
index 0 p STARTED 4336256 14gb IP1 host1_9300
index 0 r STARTED 4336256 14gb IP4 host4_9300
index 3 p INITIALIZING IP3 host3_9300
index 3 r UNASSIGNED
index 1 r STARTED 4340540 14.5gb IP2 host2_9300
index 1 p STARTED 4340540 14.5gb IP1 host1_9300
index 2 r STARTED 4333466 15.1gb IP2 host2_9300
index 2 p STARTED 4333466 15.1gb IP4 host4_9300

Then ES has started :slight_smile:

You will probably need to look at the logs on host3 and see what is happening.

Hi mark- the log i have attached in the description is the content of log from host3 and it is complaining abt file not found exception

[2016-03-29 08:24:12,543][WARN ][cluster.action.shard ]
[instance_9300] [default-index][3] received shard failed for [default-index][3],
node[XXXXXXXXX], [P], s[INITIALIZING], indexUUID [nR_XXXXXXX],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[default-index][3]
failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[default-index][3]
shard allocated for local recovery (post api), should exist, but doesn't, current files:
[_222.si, _2222.fdx, 1111.fnm, ----- list of all files ]];
nested: FileNotFoundException[No such file [_3yxri.si]]; ]]

Ouch, you might need to shutdown that node so that the replica can take its place.

Then - upgrade! Versions prior to 1.5 have known corruption issues.

HI Mark - our situation is the corruption is on Disaster recovery box and primary elasticsearch modules are still up and running .

Solution we are looking is to fix the current issue and resolve the UNASSIGNED shards . Elastic search upgrade is planned in future.

Any help to resolve the UNASSIGNED shards issue would be great .

ANy one can suggest a solution for the issue .. we need to get around the UNASSIGNED SHARD issue

we are having one of the shard corrupted - is there any way to re-create the shard alone .

$java -cp elasticsearch-1.3.4/lib/lucene-core-4.9.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $shard_d -verbose

Opening index @ /data/host_9300/es_nameXXX/nodes/0/indices/default-index/3/index/

ERROR: could not read any segments file in directory
java.nio.file.NoSuchFileException: /data/host_9300/es_nameXXX/nodes/0/indices/default-index/3/index/_3XXXX.si
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113)
at org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:49)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:361)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:912)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:758)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:398)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)

Check the other node that has the replica, does that actually have data in the shard directory?

We checked and data it is complaining and is not available in both primary & replica shards - looks like we lost some data while doing sync between prod & DR . we are OK to have some data loss and get the shard started some how and reason for that is as below

  1. shard corruption is on Disaster recovery box and once the shard changes from UNASSIGNED -STARTED we will sync the snapshots from prod to dr box and the do a restore.

I managed to the fix the corrupted shard issue by deleting the shard alone

by re-route all data in the shard directory for the index is lost and snapshot is restored from primary.

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands": [{"allocate": {"index": "", "shard": 3, "node": "node_name", "allow_primary": true }}]}'

can have a peaceful weekend :relaxed:

some commands if you need for troubleshooting.
curl 'http://localhost:9200/_cat/segments?v'
curl -XGET 'http://localhost:9200/_recovery?pretty=true'
curl -XGET 'http://localhost:9200/index_name/_recovery?pretty=true'
curl -XGET 'http://localhost:9200/_cluster/health?level=indices&pretty'
curl -XGET 'localhost:9200/_cat/recovery?v'
curl -XGET 'http://localhost:9200/_cat/shards?pretty=true'
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

That's it really fixing it as you lost the data.

I'd strongly recommend upgrading as previously mentioned.