CorruptIndexException missing .si file

keduadoi · June 13, 2020, 6:59pm

Our cluster has 3 nodes, primary shards are about 10K, and replica = 1.
The cluster had worked normally until 1 cluster showed error message:

[root@server2014 ~]# uncaught exception in thread [main]
ElasticsearchException[failed to bind service]; nested: CorruptIndexException[Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/home/elasticsearch/polaris/nodes/0/_state/segments_3on4")))]; nested: NoSuchFileException[/home/elasticsearch/polaris/nodes/0/_state/_33n9.si];
Likely root cause: java.nio.file.NoSuchFileException: /home/elasticsearch/polaris/nodes/0/_state/_33n9.si
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)

That node cannot started again, the error message always appeared, the missing file actually could not be found, we don't have any snapshot so we lost a part of our data.

So I would like to know what could be the root cause for this situation? Our elasticsearch version is 7.6. And how can we prevent this problem in the future? I'm thinking about increasing replicate and use snapshot. Thank you in advance.

Christian_Dahlqvist · June 13, 2020, 7:15pm

It sounds like you have far too many shards in your cluster. It may not be directly related to your error message, but is likely to be causing problems.

DavidTurner · June 14, 2020, 8:02am

Either the file was deleted by an external force after Elasticsearch wrote it, or else your storage is misconfigured to ignore fsync calls and there was a power outage.

keduadoi · June 14, 2020, 9:03am

Can you make it a little bit more clarification? If power outage was the case, what would happened to delete the *.si file, then why we need and how can we "configure to ignore fsync calls"? I don't have much time to work with elasticsearch so I may ask too basic questions, I'm very appreciated for any instruction.

DavidTurner · June 14, 2020, 10:20am

In this case, it's less that the file was deleted and more that it was never actually written in the first place. These things happened in order:

Elasticsearch wrote nodes/0/_state/_33n9.si and called fsync() to ensure that this write was durable (i.e. that it will persist across a power outage).
The disk acknowledged the fsync() to confirm that the write was indeed durable.
Elasticsearch wrote nodes/0/_state/segments_3on4 (which refers to _33n9.si).
The disk completes a durable write of nodes/0/_state/segments_3on4.

Step 2 is where this often falls down if your system is misconfigured: the disk claims to have durably written the file without actually having done so. This is often the default behaviour since durable writes can be slow and you get better performance numbers by lying like this. If the write wasn't really durable and then there's a power outage then when the node restarts it finds that the file simply isn't there: it was never really written.

This isn't really anything to do with Elasticsearch - it has to assume that you have configured everything for durable writes, and if that's not the case then you'll lose data.

system · July 12, 2020, 10:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch service won't start, nested elasticsearch node state exception Elasticsearch	4	3571	December 11, 2020
Recovering from missing state .si file Elasticsearch	8	3340	May 29, 2020
Unexpected file read error while reading index Elasticsearch	3	1609	December 7, 2021
What is actually causing these shard snapshot failures? Elasticsearch	8	263	October 4, 2023
Corrupt index due to missing file Elasticsearch	4	863	September 15, 2021

CorruptIndexException missing .si file

Related topics