SOLVED - ELASTICSEARCH - Unable to start elastic search service on linux

vbondalapati · March 29, 2016, 8:45pm

It all started when we terminated the the over-running snapshot restore process. after this when we tried to start the elastic search we are getting below error . could some please help , its a elastic search 1.3.4 version on linux

We are not in a position to start the fresh install - so any small hacks to get around the problem would be helpful.

Details from log - some data masked

[2016-03-29 08:24:12,543][WARN ][cluster.action.shard ]
[instance_9300] [default-index][3] received shard failed for [default-index][3],
node[XXXXXXXXX], [P], s[INITIALIZING], indexUUID [nR_XXXXXXX],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[default-index][3]
failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[default-index][3]
shard allocated for local recovery (post api), should exist, but doesn't, current files:
[_222.si, _2222.fdx, 1111.fnm, ----- list of all files ]];
nested: FileNotFoundException[No such file [_3yxri.si]]; ]]

warkolm · March 29, 2016, 8:47pm

Based on that log it looks like ES has started.
Can you not curl the host on 9200?

vbondalapati · March 29, 2016, 9:13pm

Mark - thanks for your response.

I can do curl on the host and getting a response - but the shards are getting MARKED as UNASSIGNED . shard 3 is alone complaining and remaining shards looks OK. shard 3 is distributed on host 3 & host 4 which are continuously doing excessive logging and so we are keeping them down

our set-up is as below

total 4 hosts having each ES node
5 shards

$curl --fail -XGET 'http://localhost:9200/_cat/shards?pretty=true'
index 4 p STARTED 4336176 14.8gb IP1 host1_9300
index 4 r STARTED 4336176 14.8gb IP4 host4_9300
index 0 p STARTED 4336256 14gb IP1 host1_9300
index 0 r STARTED 4336256 14gb IP4 host4_9300
index 3 p INITIALIZING IP3 host3_9300
index 3 r UNASSIGNED
index 1 r STARTED 4340540 14.5gb IP2 host2_9300
index 1 p STARTED 4340540 14.5gb IP1 host1_9300
index 2 r STARTED 4333466 15.1gb IP2 host2_9300
index 2 p STARTED 4333466 15.1gb IP4 host4_9300

warkolm · March 29, 2016, 9:24pm

Then ES has started

You will probably need to look at the logs on host3 and see what is happening.

vbondalapati · March 29, 2016, 10:44pm

Hi mark- the log i have attached in the description is the content of log from host3 and it is complaining abt file not found exception

[2016-03-29 08:24:12,543][WARN ][cluster.action.shard ]
[instance_9300] [default-index][3] received shard failed for [default-index][3],
node[XXXXXXXXX], [P], s[INITIALIZING], indexUUID [nR_XXXXXXX],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[default-index][3]
failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[default-index][3]
shard allocated for local recovery (post api), should exist, but doesn't, current files:
[_222.si, _2222.fdx, 1111.fnm, ----- list of all files ]];
nested: FileNotFoundException[No such file [_3yxri.si]]; ]]

warkolm · March 30, 2016, 1:42am

Ouch, you might need to shutdown that node so that the replica can take its place.

Then - upgrade! Versions prior to 1.5 have known corruption issues.

vbondalapati · March 30, 2016, 8:39am

HI Mark - our situation is the corruption is on Disaster recovery box and primary elasticsearch modules are still up and running .

Solution we are looking is to fix the current issue and resolve the UNASSIGNED shards . Elastic search upgrade is planned in future.

Any help to resolve the UNASSIGNED shards issue would be great .

vbondalapati · March 30, 2016, 2:51pm

ANy one can suggest a solution for the issue .. we need to get around the UNASSIGNED SHARD issue

vbondalapati · March 31, 2016, 9:10pm

we are having one of the shard corrupted - is there any way to re-create the shard alone .

$java -cp elasticsearch-1.3.4/lib/lucene-core-4.9.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $shard_d -verbose

Opening index @ /data/host_9300/es_nameXXX/nodes/0/indices/default-index/3/index/

ERROR: could not read any segments file in directory
java.nio.file.NoSuchFileException: /data/host_9300/es_nameXXX/nodes/0/indices/default-index/3/index/_3XXXX.si
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113)
at org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:49)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:361)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:912)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:758)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:398)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)

warkolm · March 31, 2016, 9:41pm

Check the other node that has the replica, does that actually have data in the shard directory?

vbondalapati · April 1, 2016, 9:19am

We checked and data it is complaining and is not available in both primary & replica shards - looks like we lost some data while doing sync between prod & DR . we are OK to have some data loss and get the shard started some how and reason for that is as below

shard corruption is on Disaster recovery box and once the shard changes from UNASSIGNED -STARTED we will sync the snapshots from prod to dr box and the do a restore.

vbondalapati · April 1, 2016, 4:07pm

I managed to the fix the corrupted shard issue by deleting the shard alone

by re-route all data in the shard directory for the index is lost and snapshot is restored from primary.

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands": [{"allocate": {"index": "", "shard": 3, "node": "node_name", "allow_primary": true }}]}'

can have a peaceful weekend

some commands if you need for troubleshooting.
curl 'http://localhost:9200/_cat/segments?v'
curl -XGET 'http://localhost:9200/_recovery?pretty=true'
curl -XGET 'http://localhost:9200/index_name/_recovery?pretty=true'
curl -XGET 'http://localhost:9200/_cluster/health?level=indices&pretty'
curl -XGET 'localhost:9200/_cat/recovery?v'
curl -XGET 'http://localhost:9200/_cat/shards?pretty=true'
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

warkolm · April 1, 2016, 8:46pm

That's it really fixing it as you lost the data.

I'd strongly recommend upgrading as previously mentioned.

Topic		Replies	Views
Failed to start shard Elasticsearch	7	380	July 6, 2017
Failed to start shard Elasticsearch	1	235	July 6, 2017
Elasticsearch wont start - ShardNotFoundException Elasticsearch	2	954	July 5, 2017
Failed to start shard Elasticsearch	1	448	July 6, 2017
Shards stuck in INITIALIZING state after CLUSTER restart Elasticsearch	4	2051	February 7, 2018

SOLVED - ELASTICSEARCH - Unable to start elastic search service on linux

Related topics