Can not restart Elasticsearch service

Hi,

Elastic Stack Version - 7.17.5

We had an issue with our server (iDRAC is unable to successfully communicate with the device RAID Controller) which caused disruption for our elastic stack.

However we solved that issue and now I can not restart the elasticsearch service.
I get the following error -

[2023-05-30T15:16:42,500][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [hot-01-node-01] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: CorruptIndexException[file mismatch, expected id=q1ylgofavl9fj2m1nayjl
lbc, got=q1ylgofavl9fj2m1nayjlmte (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si")))];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:173) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:160) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
        at org.elasticsearch.cli.Command.main(Command.java:77) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:125) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:80) ~[elasticsearch-7.17.5.jar:7.17.5]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:1088) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.node.Node.<init>(Node.java:309) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:434) ~[elasticsearch-7.17.5.jar:7.17.5]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:169) ~[elasticsearch-7.17.5.jar:7.17.5]
        ... 6 more
Caused by: org.apache.lucene.index.CorruptIndexException: file mismatch, expected id=q1ylgofavl9fj2m1nayjllbc, got=q1ylgofavl9fj2m1nayjlmte (resource=BufferedChecksumIndexI
nput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si")))
        at org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 202
1-12-14 13:46:43]
        at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-
12-14 13:46:43]
        at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:95) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42e
d1223c14b50 - janhoy - 2021-12-14 13:47:22]
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-1
2-14 13:46:43]
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-1
2-14 13:46:43]
        at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b5
0 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b5
0 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:720) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janho
y - 2021-12-14 13:46:43]
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 -
janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12
-14 13:46:43]

Can you please help? How should I fix this issue?

Best Regards,

Also this is part of the log file

Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (80c5721d). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si")))
                at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:466) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:143) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:47:22]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:720) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.elasticsearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:306) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.env.NodeEnvironment.loadNodeMetadata(NodeEnvironment.java:459) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:356) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.node.Node.<init>(Node.java:429) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.node.Node.<init>(Node.java:309) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:234) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:434) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:169) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:160) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
                at org.elasticsearch.cli.Command.main(Command.java:77) ~[elasticsearch-cli-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:125) ~[elasticsearch-7.17.5.jar:7.17.5]
                at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:80) ~[elasticsearch-7.17.5.jar:7.17.5]
        Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (eb286770). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/data/hot-01-node-01/nodes/0/_state/segments_waj")))
                at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:466) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:434) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:720) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.elasticsearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:306) ~[elasticsearch-7.17.5.jar:7.17.5]

The mount point within the server was mounted under /mnt/data/ which had issues and thus elasticsearch and other services were affected.

How can I make the service running again?

If the index has been corrupted and there are no valid copies in the cluster you most likely need to restore the data from a snapshot.

1 Like

Thanks Christian! We do not have snapshots but have the indexes stored in raw files and we can reindex them but will take not sure how much time. Is there any other way to fix this, even if it means losing some data?

What is the output of the cat shards API?

I can not get any output since elasticsearch is not running

Then I suspect you will have lost the data and need to recover by reindexing as you do not have any snapshot.

Do you suggest removing everything we have under /mnt/data/ and restart elastic? We have total 4 nodes in the cluster and the one with the issue is the master node and it seems the data is corrupted. If remove data under /mnt/data/ on all the nodes of the cluster will we be able to start with new data? Since we have the raw files and we need to get the elasticsearch up regardless than if we lose the data.

Will I run in to issue like https://discuss.elastic.co/t/i-lose-all-my-data-when-master-node-restarts/334410/3 example from 334410/3

/mnt/data/ ? Have you mounted the same storage disk to all nodes? Maybe I have understood incorrectly.
Every node should have independent disk for data.

Hey Rios,

No, they have their own respective storage. We have it on all nodes under /mnt/data/hot-01-node-01 /mnt/data/hot-02-node-01 etc

OK good. Do you have enough disk space? Sometime this might be a problem.
As Christian said, something wrong is with index/indices.
Can you check file in the /mnt/data/hot-01-node-01/nodes/0/_state/_s4e.si directory:

CorruptIndexException[file mismatch, expected id=q1ylgofavl9fj2m1nayjllbc, got=q1ylgofavl9fj2m1nayjlmte

Like naming is not OK, and check on all servers.

Yes, space is no issue. Only 20% disk is full. I checked and the file with the name _s4e.si exists on hot-01 but not on the other three nodes.

-rw-r--r-- 1 elasticsearch elasticsearch     402 May 28 03:44 _s4e.si

And since we had to repair the mount point on hot-01(master) we have there lost+found which has some data as well. But not sure if it will be of any help.

If I remove the respective directories under /mnt/data/ on all the elastic nodes of the cluster will I be able to get started with new data after restart of the services? With the above steps is it possible to break it any further from this point?

Like in this post 255376 and 229744

The data directory is independent of any other settings, so in general you can leave as it is the /mnt/data/ directory and make a new directory /mnt/data2/ change path.data: /mnt/data2/ in elasticsearch.yml, and start ES without any data, like a fresh installation but with your older ES specific settings like log path, SSL if you used, etc. Since you have enough the disk space, do not touch the /mnt/data/ directory.

There should be a tool CheckIndex for shards issues. How ES storage works is explained here. There is the topic related to similar issue when an index is corrupted but ES is still up.

I haven't try to copy a single subdirectory from /mnt/data/ to /mnt/data2/ an try to start ES but you have 4 nodes, don't know which num of replica. You can try on a single index. Check index size by command: du -h /mnt/data/ | sort -rh | head -10
Use a smaller size to test which you know that is used with purpose. Maybe is just a bad idea.

Lucene is responsible for writing and maintaining the Lucene index files while Elasticsearch writes metadata related to features on top of Lucene, such as field mappings, index settings and other cluster metadata

The best way is to wait for Christian or other ELK team members to suggest any possible idea in order to save any data.

Also, what would be the way to make the other three nodes join the cluster back?
Considering trial and error causes irrecoverable errors.

What steps should be performed to make the other nodes join the new master, since if all the data is gone than UUID will change for the master node and hence the other nodes in the cluster will not be able communicate and will refuse to join the new cluster(i.e. new UUID).

Or make the master node(which is foobar) join the cluster as the new master? is that possibl?

Also I tried the Lucene check index tool and mostly I get that the issue is from the segment and the segment info file i.e. segements_N and _s4e.si in this case

You can not make nodes that have been part of a cluster join a new one unless you wipe the data and let them join as new empty nodes. I would recommend setting up a new cluster and restoring the data from a snapshot.

Anyway segment files can be fixed like the indexes using the check index tool?

I do not know.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.