Purpose of State files

Hello Everyone,

After searching everywhere, I couldn't find the purpose of state files.

We started off with version 2.2 and then went to 5.6 and then at 6.8 atm.

Below is the scenario we frequently run into:
Elasticsearch continuously restarts due to state files either being empty or corrupted.
We do not know what's the reason behind the state file being empty or corruption.

What's interesting is, we end up deleting the state files and Elasticsearch gets initialized with no issues.
So, if the state files can be deleted, why does ES need them?
Are we losing any information by deleting those state files?
If ES finds itself in a state where there are empty state files or corrupted state files, why doesn't it fix it by itself? based on individual scenarios?
What does ES do with the data present in a state file?

I also noticed, when ES runs into an exception related to state files, it prints one at a time.
What I mean by that is, once I delete a corrupted state file, ES re-initializes and then throws an error about another state file.
Why doesn't it print all the state files which are corrupted?
Is there any way we can find out all the corrupted/empty state files?

Appreciate your time and help.

Regards,
Hari Yadavalli

1 Like

Welcome to our community! :smiley:

It'd be useful if you could share logs from the nodes with these issues and the master node.
Also, what is the output from the _cluster/stats?pretty&human API?

Thank you Mark for your response.

I'm not looking for a solution perse, in fact, I'm trying to understand more about these state files and their significance.
I can provide the logs or whatnot, but that'd go into a single issue and troubleshooting that, which I'm not after at the moment.

Ex: org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:5, legacy:false, file:/opt/panlogs/ld2/esdata/pan_cluster/nodes/0/indices/HKTrCnmXREi1bP8qfMVZ6w/_state/state-5.st]

When I go try to do a hex dump of the state file, it either is an empty file or a corrupted one which ES complains about.

Why is the file getting corrupted? Do you have some external process interfering with the files, e.g. some kind of anti-virus software? Do you have issues with your storage? Do you have any non-standard plugins installed? This should not be happening in a normal cluster as far as I know.

Files like /opt/panlogs/ld2/esdata/pan_cluster/nodes/0/indices/HKTrCnmXREi1bP8qfMVZ6w/_state/state-5.st contain the index metadata which includes settings and mappings and so on that are vital to understand the contents of the index. Sometimes they're duplicated across nodes so Elasticsearch can recover if one goes missing, but this isn't guaranteed. If Elasticsearch can't recover one then the index becomes completely unusable: you certainly won't be able to search it and you might even not be able to delete it. I believe there are also circumstances under which a missing index metadata file will prevent the whole node from starting up, rendering all the other data that it holds unreadable too.

Potentially yes.

Because states like this indicate that something is desperately wrong with the system on which Elasticsearch is running. It might be recoverable this time, but it's certainly a leading indicator that something unrecoverable will happen in the future.

Even one corrupted file indicates a broken system. There is no point in optimising for the case that the system is broken in multiple ways. The expectation is that you replace or fix the system on the first failure.

1 Like

Thank you David for your response.

I did further tests around this state file.
Topology: one Node in the cluster
I picked one index and deleted the state files for the index alone, I did not touch the global-state file or the node state file.
I went ahead and restarted Elasticsearch.
After ES is up, I noticed a new state file getting created for this index. (I made sure there is no state file present for this index anywhere)

Now, I'm curious how this could have happened.

Regards,
Hari Yadavalli

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.