Elasticsearch service won't start, nested elasticsearch node state exception

Hello,

I noticed one of my elasticsearch nodes is down and started digging into the issue. I found the service won't start and notice a couple of key errors:

ElasticsearchException[failed to bind service]; nested: CorruptIndexException[Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/nodes/0/_state/segments_2ewxw")))]; nested: NoSuchFileException[/var/lib/elasticsearch/nodes/0/_state/_25kuo.si];
Likely root cause: java.nio.file.NoSuchFileException: /var/lib/elasticsearch/nodes/0/_state/_25kuo.si

Indeed, that file is not present. What are my options for recovering this node?

Okay, after many more searches I was able to find Recovering from missing state .si file

DavidTurner helping once again, suggests that /var/lib/elasticsearch contents can be cleared without worries since any relevant data will be stored on other nodes of the cluster. I did this and re-ran my configuration management agent and the node was able to re-join the cluster.

Is there anything else I should look out for while I'm here? Otherwise I'm guessing this post will just stale out.

I'm also wondering if there's any way I can discover the root cause of this file going missing? I did say I cleared out the contents, but actually I ran mv /var/lib/elasticsearch/nodes /var/lib/elasticsearch/nodes.old, so I should be able to analyze any of the files here, but I'm not immediately aware of any tools that would let me discover the root cause. I'm going to check with my backups guy to see if he has any record of the file, but assuming these iterate in alphabetical order, then I could probably assume that it hadn't been formed yet, as there were no o's, but several p's, n's, and m's.

Anything that could help me discover the root cause would be really helpful, as I should be able to build some monitoring or CM to properly care for the directory contents. Thanks!

o comes before p so I don't think that follows :slight_smile:

The two likely explanations are (a) something other than Elasticsearch removed this file or (b) you had a power outage while Elasticsearch was writing the node state and your storage system performed some operations in the wrong order just before the outage. In either case you should be worried.

The solution is not to try and monitor the contents of the data path: it's best to consider it as being entirely under Elasticsearch's control. But it's definitely worth getting to the bottom of this if you can.

Ah my bad. I'm spending too much time on the terminal today with no breaks :stuck_out_tongue:

_25kuj.cfe
_25kuj.cfs
_25kuj.si
_25kuk.cfe
_25kuk.cfs
_25kuk.si
_25kul.cfe
_25kul.cfs
_25kul.si
_25kum.cfe
_25kum.cfs
_25kum.si
_25kun.cfe
_25kun.cfs
_25kun.si

Yes I'm wondering if the operation I ran to have the CM execute a POST to create a new user had somehow interrupted cluster operations, but it just seems so unlikely. I'm glad everything seems to be working though, I can breathe a bit easier going into the weekend.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.