Failed to load metata when restarting elasticsearch

bruce2 · January 7, 2022, 7:21am

When I restart a node of the elasticsearch cluster, it failed to start. I set log level to trace. The logs is like below:

[2022-01-05T19:2...][TRACE][o.e.g.PersistedClusterStateService] [HOSTNAME] found index metadata for [index1_name]
[2022-01-05T19:2...][TRACE][o.e.g.PersistedClusterStateService] [HOSTNAME] processing doc 440
[2022-01-05T19:2...][TRACE][o.e.b.Bootstrap ] [HOSTNAME] Exception
org.elasticsearch.ElasticsearchExecption: failed t load metadata
at org.elasticsearch.gateway.GatewayMetaState.start(GatewayMetadataSate.java:187) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.node.Node.start(Node.java:786) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.bootstrap.Bootstrap.start(Bootstrap.java:317) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:402) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.cli.EnvironmentAwareCommand.Execute(EnvironmentAwareCommand.java:86) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127).Eeecute(EnvironmentAwareCommand.java:86) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.cli.Command.main(Command.java:90).Eeecute(EnvironmentAwareCommand.java:86) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~ [elasticsearch-7.9.2.jar:7.9.2]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~ [elasticsearch-7.9.2.jar:7.9.2]
Caused by: com.fasterxml.jackson.core.JsonParseException: Input does not start with Smile format header (first byte = 0x7b) -- rather, it starts with '{' (plain JSON input?) -- can not parse
at [Source: (byte)"error": {"root_cause": [{"type": "sql_illegal_argument_exception", "reason": "The default limit [65535] for aggerate sorting has been reached; please specify a LIMIT"}], "type": }.... debugA2s?index.indexing.slowlog.threshold.index.infoA5s?index.indexing.slowlog.threshold.index.traceD500ms?index.......index.number_of_replicas"[truncated 1339 bytes]; line: -1, column: 0]
at com.fasterxml.jackson.dataformat.smile.SmileParseBootstrapper.constructParse(SmileParserBootstrapper.java:133) ~[jackson-dataformat-smile-2.10.4.jar:2.10.4]
...
...
...

At last, I backed up the elasticsearch data path, and mkdir a new empty data path, then elasticsearch is started.
Does anyone know the reason of the restart failure, and how to avoid it? Thanks.
It seems the error occurs when parse plain json data into smile format.

DavidTurner · January 7, 2022, 12:03pm

Looks like the bug fixed by Fix DefaultRestChannel Corrupting Shared Buffers on Serialialization Issues by original-brownbear · Pull Request #72274 · elastic/elasticsearch · GitHub - you should upgrade to pick up the fix.

bruce2 · January 10, 2022, 3:17am

Thanks for your kind reply.
David, I don't understand about the bug PR #72274 fixed, Would you explain what impact the bug has, and why you think my restart failure is caused by the bug? Do you know how to reproduce the restart failure?

DavidTurner · January 10, 2022, 7:51am

Writing the cluster state uses a shared buffer which in your case appears to have been corrupted by a REST response. The bug I linked explains exactly how that can happen in your version. It's a very rare race condition but yes it is possible to reproduce it.

bruce2 · January 14, 2022, 6:43am

Thank you, Sorry David.
I don't understand all of your conversatioin in the PR page.
Is there a full steps by which I can reproduce the bug? I want reproduce the issue.

bruce2 · January 14, 2022, 7:02am

Now I have found the corrupted file $(path.data)/nodes/0/_state/_10up.fdt with the command below.

grep 'sql_illegal_argument_exception' -r $(path.data)/nodes/0

In my view, It's this file that stop the elasticsearch node to be start. After read the PR #72274 conversation, I got some of why the issue happened, although not clear about the detail. It seems because of the bug #72274 fixed, the response json "{"root_cause": [{"type": "sql_illegal_argument_exception.." which should only be sent as a query response was writed to the file $(path.data)/nodes/0/_state/_10up.fdt by mistake. Then it failed to started when we restart elasticsearch node, because the elasticsearch can't not parse the file $(path.data)/nodes/0/_state/_10up.fdt correctly as the error message says below.

Caused by: com.fasterxml.jackson.core.JsonParseException: Input does not start with Smile format header (first byte = 0x7b) -- rather, it starts with '{' (plain JSON input?) -- can not parse
at [Source: (byte)"error": {"root_cause": [{"type": "sql_illegal_argument_exception", "reason": "The default limit [65535] for aggerate sorting has been reached; please specify a LIMIT"}], "type": }.... debugA2s?index.indexing.slowlog.threshold.index.infoA5s?index.indexing.slowlog.threshold.index.traceD500ms?index.......index.number_of_replicas"[truncated 1339 bytes]; line: -1, column: 0]

DavidTurner · January 14, 2022, 7:46am

It's not easy to reproduce race conditions like the one you hit here, and I don't have the time to help you do so unfortunately. Maybe someone else will step in to help you but all I can do is recommend again that you upgrade to pick up the fix.

bruce2 · January 14, 2022, 8:23am

Thank you again. I will upgrade the elasticseatch cluster.

system · February 11, 2022, 8:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Not able to start elasticsearch Elasticsearch	2	1095	October 26, 2020
My ela cluster failed to load metadata when running, but I can't see the other problem. If there is the same problem, please help to take a look Elasticsearch	9	1057	March 27, 2023
Elasticsearch can not started：failed to find global metadata Elasticsearch	10	753	November 16, 2021
Failed to find metadata for existing index Elasticsearch	4	4206	May 30, 2019
Failed to find metadata for existing index after node restart Elasticsearch	6	5757	September 30, 2019

Failed to load metata when restarting elasticsearch

Related topics