Upgrade to 8.1.2 failing with unexpected folder encountered during data folder upgrade

Hello,

I upgraded 1 of the 2 data nodes in my 3 node cluster (3rd node is voting only and does not store any data) from 7.17.2 to 8.1.2. Post the upgrade, Elasticsearch service is crashing with the following error:

Exception
java.lang.IllegalStateException: unexpected folder encountered during data folder upgrade: /mnt/ssd1/var/lib/elasticsearch/nodes/0/_state_30-05-2021

Few more lines from the logfile

[2022-04-20T04:48:42,732][INFO ][o.e.e.NodeEnvironment    ] [secondarynode] oldest index version recorded in NodeMetadata 7.8.1
[2022-04-20T04:48:42,733][ERROR][o.e.b.Bootstrap          ] [secondarynode] Exception
java.lang.IllegalStateException: unexpected folder encountered during data folder upgrade: /mnt/ssd1/var/lib/elasticsearch/nodes/0/_state_30-05-2021
        at org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:431) ~[elasticsearch-8.1.2.jar:8.1.2]

How do I resolve this error?

This is not a folder that Elasticsearch would create so it looks like someone or something else has been meddling with the contents of the data path. This is very strongly not recommended and can lead to all sorts of problems.

I would recommend restoring the cluster from a snapshot into a clean data path.

Thank you very much @DavidTurner. It is odd since the host is exclusively an ES node. Further the data of ES is written to another SSD from the OS, meaning the /mnt/ should have no other data written even from the OS.

I do reckon this has something to do with last years disk failure that you helpmed me diagnose in this thread (the month of the thread matches :expressionless: - Node sync fails and cluster goes to "red" - #21 by parthmaniar).

I will need to replace the SSD meaning the new one will have no data (but the Elasticsearch settings and the OS will remain as is). Hence, is following feasable for recovery?:

  1. Attach a new SSD to the VM with the same mounth path as the previous one.
  2. Upgrade last remaining ES data node to 8.1.2. This node has all of the data.
  3. I have turned off routing of data:
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}
  1. I can eable routing of data which would fill up the new disk? Will this work?

The reason I am stuck is:

  1. One data & master node is on 7.17.2 and it has all of the data intact.
  2. One data & master node is on 8.1.2 but has the data disk (data folder for ES) that has failed
  3. Third is a voting only node that has upgraded successfully.

Thank you very much.

Yes, if your cluster health is yellow then you can simply replace this node with a new (empty) 8.1.2 one and let Elasticsearch rebuild its contents. I recommend waiting until the health is green before upgrading the final node.

This is where I am confused.

My current cluster status:

  1. primarynode with node.roles: [ data, master ] is fully operational running 7.17.2
  2. secondarynode with node.roles: [ data, master ] has storage failure and has been upgraded to 8.1.2
  3. votingonlynode with node.roles: [ master, voting_only ] is upgraded to 8.1.2 & ES service is running

I am unable to query the nodes (API calls via Postman are giving the following output):

{
    "error": {
        "root_cause": [
            {
                "type": "security_exception",
                "reason": "unable to authenticate user [elastic] for REST request [/_cluster/health/]",
                "header": {
                    "WWW-Authenticate": [
                        "Bearer realm=\"security\"",
                        "ApiKey",
                        "Basic realm=\"security\" charset=\"UTF-8\""
                    ]
                }
            }
        ],

I am not sure why, but I get these when 1 of the two master node goes down.
maybe it is because of
discovery.seed_hosts: ["primarynode", "secondarynode", "votingonlynode"]

Any guidance here?

Anything in the logs? Assuming your credentials are correct I guess this means the cluster health is not yellow. I suggest downgrading secondarynode back to 7.17.2 until you work out what's going on here. Downgrades typically don't work but it should be ok here since unexpected folder encountered during data folder upgrade happens so early in startup.

Here is the current status (I think the initial one was because it was searching for the secondary node)

{

    "cluster_name": "data_analytics_1",
    "status": "red",
    "timed_out": false,
    "number_of_nodes": 2,
    "number_of_data_nodes": 1,
    "active_primary_shards": 1177,
    "active_shards": 1177,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 876,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 57.3307355090112

}

I've got a new SSD and the VM is up and running. Secondarynode has Elasticsearch 8.1.2 running (I got a prompt for 8.1.3 - I reckon Elastic needs to come up with an update release cycle :expressionless: )

"cluster_name": "data_analytics_1",
    "status": "red",
    "timed_out": false,
    "number_of_nodes": 3,
    "number_of_data_nodes": 2,
    "active_primary_shards": 1189,
    "active_shards": 1189,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 1195,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 49.874161073825505
}

The status seems to be red since I have no enabled routing of shards.

Should I upgrade the last node to 8.1.3 - this is the only node with data right now an enable the routing of shards to sync primary (running 7.17.2) and secondary nodes? or enable sync before the upgrade so that there are two copies of the data?

I think you should have downgraded as per my previous message, at least until you work out what's going on. The cluster is in red health now so it seems you've lost some primary shards.

Thanks David. I will start the rebuild process. I have taken snapshots of the data including full backup of the VMs before attempting quick and of course ignorant fix. Sorry for that and thank you very much. :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.