Can not restart Elasticsearch service

lcsb-sysadmins · May 30, 2023, 1:59pm

Hi,

Elastic Stack Version - 7.17.5

We had an issue with our server (iDRAC is unable to successfully communicate with the device RAID Controller) which caused disruption for our elastic stack.

However we solved that issue and now I can not restart the elasticsearch service.
I get the following error -

[2023-05-30T15:16:42,500][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [hot-01-node-01] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: CorruptIndexException[file mismatch, expected id=q1ylgofavl9fj2m1nayjl
lbc, got=q1ylgofavl9fj2m1nayjlmte (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si")))];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:173) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:160) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
        at org.elasticsearch.cli.Command.main(Command.java:77) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:125) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:80) ~[elasticsearch-7.17.5.jar:7.17.5]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:1088) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.node.Node.<init>(Node.java:309) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:434) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:169) ~[elasticsearch-7.17.5.jar:7.17.5]
        ... 6 more
Caused by: org.apache.lucene.index.CorruptIndexException: file mismatch, expected id=q1ylgofavl9fj2m1nayjllbc, got=q1ylgofavl9fj2m1nayjlmte (resource=BufferedChecksumIndexI
nput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si")))
        at org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 202
1-12-14 13:46:43]
        at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-
12-14 13:46:43]
        at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:95) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42e
d1223c14b50 - janhoy - 2021-12-14 13:47:22]
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-1
2-14 13:46:43]
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-1
2-14 13:46:43]
        at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b5
0 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b5
0 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:720) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janho
y - 2021-12-14 13:46:43]
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 -
janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12
-14 13:46:43]

Can you please help? How should I fix this issue?

Best Regards,

lcsb-sysadmins · May 30, 2023, 4:15pm

Also this is part of the log file

Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (80c5721d). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si")))
                at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:466) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:143) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:47:22]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:720) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.elasticsearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:306) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.env.NodeEnvironment.loadNodeMetadata(NodeEnvironment.java:459) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:356) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.node.Node.<init>(Node.java:429) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.node.Node.<init>(Node.java:309) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:434) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:169) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:160) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
                at org.elasticsearch.cli.Command.main(Command.java:77) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:125) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:80) ~[elasticsearch-7.17.5.jar:7.17.5]
        Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (eb286770). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/segments_waj")))
                at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:466) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:434) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:720) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.elasticsearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:306) ~[elasticsearch-7.17.5.jar:7.17.5]

The mount point within the server was mounted under /mnt/data/ which had issues and thus elasticsearch and other services were affected.

How can I make the service running again?

Christian_Dahlqvist · May 30, 2023, 4:27pm

If the index has been corrupted and there are no valid copies in the cluster you most likely need to restore the data from a snapshot.

lcsb-sysadmins · May 31, 2023, 8:59am

Thanks Christian! We do not have snapshots but have the indexes stored in raw files and we can reindex them but will take not sure how much time. Is there any other way to fix this, even if it means losing some data?

Christian_Dahlqvist · May 31, 2023, 9:27am

What is the output of the cat shards API?

lcsb-sysadmins · May 31, 2023, 10:00am

I can not get any output since elasticsearch is not running

Christian_Dahlqvist · May 31, 2023, 10:02am

Then I suspect you will have lost the data and need to recover by reindexing as you do not have any snapshot.

lcsb-sysadmins · May 31, 2023, 10:37am

Do you suggest removing everything we have under /mnt/data/ and restart elastic? We have total 4 nodes in the cluster and the one with the issue is the master node and it seems the data is corrupted. If remove data under /mnt/data/ on all the nodes of the cluster will we be able to start with new data? Since we have the raw files and we need to get the elasticsearch up regardless than if we lose the data.

lcsb-sysadmins · May 31, 2023, 10:52am

Will I run in to issue like https://discuss.elastic.co/t/i-lose-all-my-data-when-master-node-restarts/334410/3 example from 334410/3

Rios · May 31, 2023, 11:00am

/mnt/data/ ? Have you mounted the same storage disk to all nodes? Maybe I have understood incorrectly.
Every node should have independent disk for data.

lcsb-sysadmins · May 31, 2023, 11:04am

Hey Rios,

No, they have their own respective storage. We have it on all nodes under /mnt/data/hot-01-node-01 /mnt/data/hot-02-node-01 etc

Rios · May 31, 2023, 11:10am

OK good. Do you have enough disk space? Sometime this might be a problem.
As Christian said, something wrong is with index/indices.
Can you check file in the /mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si directory:

CorruptIndexException[file mismatch, expected id=q1ylgofavl9fj2m1nayjllbc, got=q1ylgofavl9fj2m1nayjlmte

Like naming is not OK, and check on all servers.

lcsb-sysadmins · May 31, 2023, 12:41pm

Yes, space is no issue. Only 20% disk is full. I checked and the file with the name _s4e.si exists on hot-01 but not on the other three nodes.

-rw-r--r-- 1 elasticsearch elasticsearch     402 May 28 03:44 _s4e.si

And since we had to repair the mount point on hot-01(master) we have there lost+found which has some data as well. But not sure if it will be of any help.

lcsb-sysadmins · May 31, 2023, 3:32pm

If I remove the respective directories under /mnt/data/ on all the elastic nodes of the cluster will I be able to get started with new data after restart of the services? With the above steps is it possible to break it any further from this point?

Like in this post 255376 and 229744

Rios · June 1, 2023, 6:31am

The data directory is independent of any other settings, so in general you can leave as it is the /mnt/data/ directory and make a new directory /mnt/data2/ change path.data: /mnt/data2/ in elasticsearch.yml, and start ES without any data, like a fresh installation but with your older ES specific settings like log path, SSL if you used, etc. Since you have enough the disk space, do not touch the /mnt/data/ directory.

There should be a tool CheckIndex for shards issues. How ES storage works is explained here. There is the topic related to similar issue when an index is corrupted but ES is still up.

I haven't try to copy a single subdirectory from /mnt/data/ to /mnt/data2/ an try to start ES but you have 4 nodes, don't know which num of replica. You can try on a single index. Check index size by command: du -h /mnt/data/ | sort -rh | head -10
Use a smaller size to test which you know that is used with purpose. Maybe is just a bad idea.

Lucene is responsible for writing and maintaining the Lucene index files while Elasticsearch writes metadata related to features on top of Lucene, such as field mappings, index settings and other cluster metadata

The best way is to wait for Christian or other ELK team members to suggest any possible idea in order to save any data.

lcsb-sysadmins · June 1, 2023, 12:43pm

Also, what would be the way to make the other three nodes join the cluster back?
Considering trial and error causes irrecoverable errors.

What steps should be performed to make the other nodes join the new master, since if all the data is gone than UUID will change for the master node and hence the other nodes in the cluster will not be able communicate and will refuse to join the new cluster(i.e. new UUID).

Or make the master node(which is foobar) join the cluster as the new master? is that possibl?

Also I tried the Lucene check index tool and mostly I get that the issue is from the segment and the segment info file i.e. segements_N and _s4e.si in this case

Christian_Dahlqvist · June 1, 2023, 12:46pm

You can not make nodes that have been part of a cluster join a new one unless you wipe the data and let them join as new empty nodes. I would recommend setting up a new cluster and restoring the data from a snapshot.

lcsb-sysadmins · June 1, 2023, 12:56pm

Anyway segment files can be fixed like the indexes using the check index tool?

Christian_Dahlqvist · June 1, 2023, 12:57pm

I do not know.

system · June 29, 2023, 12:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch fails to restart after upgrade Elasticsearch	10	1930	January 19, 2018
I am getting an error when I tried to restart elasticsearch service Elasticsearch	7	20814	July 5, 2017
Recovering from missing state .si file Elasticsearch	8	3340	May 29, 2020
Restart elasticsearch issue Elasticsearch	5	1823	July 12, 2017
Elasticsearch failed to restart Elasticsearch	36	25184	August 13, 2018

Can not restart Elasticsearch service

Related topics