Failed to load metata when restarting elasticsearch

When I restart a node of the Elasticsearch cluster, it failed to start. I set log level to trace. The logs is like below:

[2022-01-05T19:2...][TRACE][o.e.g.PersistedClusterStateService] [HOSTNAME] found index metadata for [index1_name]
[2022-01-05T19:2...][TRACE][o.e.g.PersistedClusterStateService] [HOSTNAME] processing doc 440
[2022-01-05T19:2...][TRACE][o.e.b.Bootstrap ] [HOSTNAME] Exception
org.Elasticsearch.ElasticsearchExecption: failed t load metadata
at org.Elasticsearch.gateway.GatewayMetaState.start(GatewayMetadataSate.java:187) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.node.Node.start(Node.java:786) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.bootstrap.Bootstrap.start(Bootstrap.java:317) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:402) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.cli.EnvironmentAwareCommand.Execute(EnvironmentAwareCommand.java:86) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127).Eeecute(EnvironmentAwareCommand.java:86) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.cli.Command.main(Command.java:90).Eeecute(EnvironmentAwareCommand.java:86) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~ [Elasticsearch-7.9.2.jar:7.9.2]
at org.Elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~ [Elasticsearch-7.9.2.jar:7.9.2]
Caused by: com.fasterxml.jackson.core.JsonParseException: Input does not start with Smile format header (first byte = 0x7b) -- rather, it starts with '{' (plain JSON input?) -- can not parse
at [Source: (byte)"error": {"root_cause": [{"type": "sql_illegal_argument_exception", "reason": "The default limit [65535] for aggerate sorting has been reached; please specify a LIMIT"}], "type": }.... debugA2s?index.indexing.slowlog.threshold.index.infoA5s?index.indexing.slowlog.threshold.index.traceD500ms?index.......index.number_of_replicas"[truncated 1339 bytes]; line: -1, column: 0]
at com.fasterxml.jackson.dataformat.smile.SmileParseBootstrapper.constructParse(SmileParserBootstrapper.java:133) ~[jackson-dataformat-smile-2.10.4.jar:2.10.4]
...
...
...

At last, I backed up the Elasticsearch data path, and mkdir a new empty data path, then Elasticsearch is started.
Does anyone know the reason of the restart failure, and how to avoid it? Thanks.
It seems the error occurs when parse plain json data into smile format.

Looks like the bug fixed by Fix DefaultRestChannel Corrupting Shared Buffers on Serialialization Issues by original-brownbear · Pull Request #72274 · elastic/elasticsearch · GitHub - you should upgrade to pick up the fix.

Thanks for your kind reply.
David, I don't understand about the bug PR #72274 fixed, Would you explain what impact the bug has, and why you think my restart failure is caused by the bug? Do you know how to reproduce the restart failure?

Writing the cluster state uses a shared buffer which in your case appears to have been corrupted by a REST response. The bug I linked explains exactly how that can happen in your version. It's a very rare race condition but yes it is possible to reproduce it.

:grinning: Thank you, Sorry David.
I don't understand all of your conversatioin in the PR page.
Is there a full steps by which I can reproduce the bug? I want reproduce the issue.

Now I have found the corrupted file $(path.data)/nodes/0/_state/_10up.fdt with the command below.

grep 'sql_illegal_argument_exception' -r $(path.data)/nodes/0

In my view, It's this file that stop the Elasticsearch node to be start. After read the PR #72274 conversation, I got some of why the issue happened, although not clear about the detail. It seems because of the bug #72274 fixed, the response json "{"root_cause": [{"type": "sql_illegal_argument_exception.." which should only be sent as a query response was writed to the file $(path.data)/nodes/0/_state/_10up.fdt by mistake. Then it failed to started when we restart Elasticsearch node, because the Elasticsearch can't not parse the file $(path.data)/nodes/0/_state/_10up.fdt correctly as the error message says below.

Caused by: com.fasterxml.jackson.core.JsonParseException: Input does not start with Smile format header (first byte = 0x7b) -- rather, it starts with '{' (plain JSON input?) -- can not parse
at [Source: (byte)"error": {"root_cause": [{"type": "sql_illegal_argument_exception", "reason": "The default limit [65535] for aggerate sorting has been reached; please specify a LIMIT"}], "type": }.... debugA2s?index.indexing.slowlog.threshold.index.infoA5s?index.indexing.slowlog.threshold.index.traceD500ms?index.......index.number_of_replicas"[truncated 1339 bytes]; line: -1, column: 0]

It's not easy to reproduce race conditions like the one you hit here, and I don't have the time to help you do so unfortunately. Maybe someone else will step in to help you but all I can do is recommend again that you upgrade to pick up the fix.

Thank you again. I will upgrade the elasticseatch cluster.