[User error, please ignore] Can't upgrade from 7.17.19 to ES8. Docs indicate that you should be able to

I am upgrading from 7.17.19 to 8.11.3 specifically and got the following message:
Upgrading to [8.11.3] is only supported from version [7.17.0] (full error below)

[2024-06-28T19:16:18,650][ERROR][org.elasticsearch.bootstrap.Elasticsearch] [eng-test-es8-index] fatal exception while booting Elasticsearch
org.elasticsearch.ElasticsearchException: failed to bind service
    at org.elasticsearch.node.Node.<init>(Node.java:1230) ~[elasticsearch-8.11.3.jar:?]
    at org.elasticsearch.node.Node.<init>(Node.java:344) ~[elasticsearch-8.11.3.jar:?]
    at org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:236) ~[elasticsearch-8.11.3.jar:?]
    at org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:236) ~[elasticsearch-8.11.3.jar:?]
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:73) ~[elasticsearch-8.11.3.jar:?]
Caused by: org.elasticsearch.gateway.CorruptStateException: Format version is not supported. Upgrading to [8.11.3] is only supported from version [7.17.0].
    at org.elasticsearch.env.NodeEnvironment.checkForIndexCompatibility(NodeEnvironment.java:518) ~[elasticsearch-8.11.3.jar:?]
    at org.elasticsearch.env.NodeEnvironment.upgradeLegacyNodeFolders(NodeEnvironment.java:417) ~[elasticsearch-8.11.3.jar:?]
    at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:310) ~[elasticsearch-8.11.3.jar:?]
    at org.elasticsearch.node.Node.<init>(Node.java:499) ~[elasticsearch-8.11.3.jar:?]
    ... 4 more

From the upgrade documentation it says you must first upgrade to 7.17 but nowhere does it callout 7.17.0 specifically. And because you cannot downgrade, it puts customers in a tough situation if they've already upgraded to 7.17.1 and higher patch versions.

It seems like the documentation is wrong here.

EDIT: further inspection of the code makes it seem like this error message is misleading. It probably means to say "from version [7.17.0] and above"

Extra info: The error is quite reproducible and the v7 server comes up just fine and recognizes the indexes/data so I'm not sure why it's having trouble upgrading to v8.

Did you run first the upgrade assistant? And what did it tell?

It doesn't respond with anything useful: [error][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. Response Error which we already knew from the elasticsearch failure since it fails after having gone down this code path which indicates a problem getting the node/index metadata.

Full kibana output below:

kibana-7.17.22-darwin-x86_64 [11:27:51] $ ./bin/kibana
Kibana is currently running with legacy OpenSSL providers enabled! For details and instructions on how to disable see https://www.elastic.co/guide/en/kibana/7.17/production.html#openssl-legacy-provider
  log   [11:27:59.041] [info][plugins-service] Plugin "metricsEntities" is disabled.
  log   [11:27:59.103] [info][server][Preboot][http] http server running at http://localhost:5601
  log   [11:27:59.143] [warning][config][deprecation] Starting in 8.0, the Kibana logging format will be changing. This may affect you if you are doing any special handling of your Kibana logs, such as ingesting logs into Elasticsearch for further analysis. If you are using the new logging configuration, you are already receiving logs in both old and new formats, and the old format will simply be going away. If you are not yet using the new logging configuration, the log format will change upon upgrade to 8.0. Beginning in 8.0, the format of JSON logs will be ECS-compatible JSON, and the default pattern log format will be configurable with our new logging system. Please refer to the documentation for more information about the new logging format.
  log   [11:27:59.144] [warning][config][deprecation] The default mechanism for Reporting privileges will work differently in future versions, which will affect the behavior of this cluster. Set "xpack.reporting.roles.enabled" to "false" to adopt the future behavior before upgrading.
  log   [11:27:59.145] [warning][config][deprecation] User sessions will automatically time out after 8 hours of inactivity starting in 8.0. Override this value to change the timeout.
  log   [11:27:59.146] [warning][config][deprecation] Users are automatically required to log in again after 30 days starting in 8.0. Override this value to change the timeout.
  log   [11:27:59.235] [info][plugins-system][standard] Setting up [113] plugins: [translations,licensing,globalSearch,globalSearchProviders,features,licenseApiGuard,code,usageCollection,xpackLegacy,taskManager,telemetryCollectionManager,telemetryCollectionXpack,kibanaUsageCollection,share,embeddable,uiActionsEnhanced,screenshotMode,banners,telemetry,newsfeed,mapsEms,mapsLegacy,kibanaLegacy,fieldFormats,expressions,dataViews,charts,esUiShared,bfetch,data,savedObjects,presentationUtil,expressionShape,expressionRevealImage,expressionRepeatImage,expressionMetric,expressionImage,customIntegrations,home,searchprofiler,painlessLab,grokdebugger,management,watcher,licenseManagement,advancedSettings,spaces,security,savedObjectsTagging,reporting,canvas,lists,ingestPipelines,fileUpload,encryptedSavedObjects,dataEnhanced,cloud,snapshotRestore,eventLog,actions,alerting,triggersActionsUi,transform,stackAlerts,ruleRegistry,visualizations,visTypeXy,visTypeVislib,visTypeVega,visTypeTimelion,visTypeTagcloud,visTypeTable,visTypePie,visTypeMetric,visTypeMarkdown,tileMap,regionMap,expressionTagcloud,expressionMetricVis,console,graph,fleet,indexManagement,remoteClusters,crossClusterReplication,indexLifecycleManagement,dashboard,maps,dashboardMode,dashboardEnhanced,visualize,visTypeTimeseries,rollup,indexPatternFieldEditor,lens,cases,timelines,discover,osquery,observability,discoverEnhanced,dataVisualizer,ml,uptime,securitySolution,infra,upgradeAssistant,monitoring,logstash,enterpriseSearch,apm,savedObjectsManagement,indexPatternManagement]
  log   [11:27:59.248] [info][plugins][taskManager] TaskManager is identified by the Kibana UUID: 48895a9e-ba64-4d6a-b524-76e32403e375
  log   [11:27:59.321] [warning][config][plugins][security] Generating a random key for xpack.security.encryptionKey. To prevent sessions from being invalidated on restart, please set xpack.security.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command.
  log   [11:27:59.322] [warning][config][plugins][security] Session cookies will be transmitted over insecure connections. This is not recommended.
  log   [11:27:59.334] [warning][config][plugins][security] Generating a random key for xpack.security.encryptionKey. To prevent sessions from being invalidated on restart, please set xpack.security.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command.
  log   [11:27:59.335] [warning][config][plugins][security] Session cookies will be transmitted over insecure connections. This is not recommended.
  log   [11:27:59.346] [warning][config][plugins][reporting] Generating a random key for xpack.reporting.encryptionKey. To prevent sessions from being invalidated on restart, please set xpack.reporting.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command.
  log   [11:27:59.360] [warning][encryptedSavedObjects][plugins] Saved objects encryption key is not set. This will severely limit Kibana functionality. Please set xpack.encryptedSavedObjects.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command.
  log   [11:27:59.369] [warning][actions][plugins] APIs are disabled because the Encrypted Saved Objects plugin is missing encryption key. Please set xpack.encryptedSavedObjects.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command.
  log   [11:27:59.379] [warning][alerting][plugins] APIs are disabled because the Encrypted Saved Objects plugin is missing encryption key. Please set xpack.encryptedSavedObjects.encryptionKey in the kibana.yml or use the bin/kibana-encryption-keys command.
  log   [11:27:59.386] [info][plugins][ruleRegistry] Installing common resources shared between all indices
  log   [11:27:59.683] [info][config][plugins][reporting] Chromium sandbox provides an additional layer of protection, and is supported for Darwin OS. Automatically enabling Chromium sandbox.
  log   [11:27:59.823] [error][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. Response Error

Additional notes: First time kibana user, but I had to do kubectl port-forward pod/<my es pod> 9200 and had elasticsearch.hosts: ["http://localhost:9200"] in my kibana's ./config/kibana.yml

@dadoonet after some debugging, it was found that at line 349 of PersistedClusterState#nodeMetadata, the indexPath is <data_dir>/nodes/0/_state and on the subsequent line of code, Files.exists(indexPath) evaluates to false and the file indeed is not on disk, so metadata on the node's version is not determined. This is pretty bizarre, as I can see that before the upgrade, this file does exist.

Edit: actually, is this normal? Elasticsearch v8 rewrites the nodes directory structure and puts a _state directory at the top level of the data dir

I think this might be Upgrade from 7.x to 8.x fails if 7.x node didn't fully start up · Issue #109544 · elastic/elasticsearch · GitHub. If the _state path doesn't exist then there's no data to upgrade, just start a new node rather than trying to upgrade a basically-empty node that has ended up in an unexpected state.

If you create a 7.17.x node, kill it at exactly the wrong time during startup

At least this part of the issue doesn't match my issue. With my issue, the 7.17.X node has properly come up, started up, and has been in operation for a long time.

That's not to say the issues aren't related though...

Hmm I see, no, this issue is only about freshly-started nodes. We do move stuff around in the data path during the upgrade but I've never seen it get into this state (nor has any of the many tests of this area of the system).

@DavidTurner btw, our data dir is a symlink and I believe this ought to work after your fix in 8.3

I see, and does it reproduce if you upgrade a node that doesn't use a symlink for its data path?

FWIW an upgrade (from 7.17.15 to 8.14.1) works for me with a symlinked data path:

$ ls -adl /Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0
lrwxr-xr-x@ 1 davidturner  staff  31 30 Jun 20:52 /Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0 -> ../elasticsearch-7.17.15/data-0
$ cat /Users/davidturner/discuss/362229/elasticsearch-8.14.1/logs/elasticsearch.log  | grep -e 'NodeEnvironment'
[2024-06-30T20:53:02,018][INFO ][o.e.e.NodeEnvironment    ] [node-0] using [1] data paths, mounts [[/System/Volumes/Data (/dev/disk3s5)]], net usable_space [127.7gb], net total_space [926.3gb], types [apfs]
[2024-06-30T20:53:02,018][INFO ][o.e.e.NodeEnvironment    ] [node-0] heap size [1gb], compressed ordinary object pointers [true]
[2024-06-30T20:53:02,020][INFO ][o.e.e.NodeEnvironment    ] [node-0] upgrading legacy data folders: [/Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0]
[2024-06-30T20:53:02,046][INFO ][o.e.e.NodeEnvironment    ] [node-0] oldest index version recorded in NodeMetadata 7171599
[2024-06-30T20:53:02,047][INFO ][o.e.e.NodeEnvironment    ] [node-0] data folder upgrade: moved from [/Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0/nodes/0/indices] to [/Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0/indices]
[2024-06-30T20:53:02,047][INFO ][o.e.e.NodeEnvironment    ] [node-0] data folder upgrade: moved from [/Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0/nodes/0/_state] to [/Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0/_state]
[2024-06-30T20:53:02,048][INFO ][o.e.e.NodeEnvironment    ] [node-0] data folder upgrade: moved from [/Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0/nodes/0/snapshot_cache] to [/Users/davidturner/discuss/362229/elasticsearch-8.14.1/data-0/snapshot_cache]

Sorry, I don't have any real reason to suspect the symlink other than it's the only weird thing I think we do...but I think it's notable. I have been trying to repro this but it hasn't been able to repro on my local environment with docker. The symlink in my docker environment with es8 also comes up successfully.

But this 100% consistently repros in our cloud environments.

ugh, found some very old code tucked away in our cloud code in another repo that deleted the _state directory to force some rediscovery (don't ask me why, I didn't write it)

so that'd explain it. sorry for the false alarm, and thank you all for helping out and talking through it with me

1 Like

Welp yeah that'd do it (and all sorts of other awful effects besides) :grin: Thanks for reporting back.