Dealing with node failures

dd_d · April 13, 2018, 8:33am

My cluster config :
ES version : 6.2.3
number of nodes : 3 (master eligible/data)
number of replica : 1

I ran tests sending concurrent search requests to ES cluster using java rest client api(3 hosts, with sniffer), then killed a node.
My expectation was that,
since there are replicas on the other two nodes, the requests would succeed.
But it failed partially and the response status was 200.

"took": 558,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 5,
"skipped": 0,
"failed": 1,
"failures": [
{
"shard": 5,
"index": "v1",
"reason": {
"type": "node_disconnected_exception",
"reason": "~~[indices:data/read/search[phase/fetch/id]] disconnected"
}
}
]
}
}

So my questions are,

Is there any way to get success response(without shard failure)?
If I have to deal with the retry on that cases, how long it would be take to get success response?
When I retried right after the shard failures, all requests ended with success. I wonder If I can get success response always with single retry on the node_disconnected_exception failure.

Thanks.

Releated topics :
Should Elasticsearch return a non-200 response if there are shard failures? #18978

github.com/elastic/elasticsearch

Should Elasticsearch return a non-200 response if there are shard failures?

opened 02:58PM - 20 Jun 16 UTC

closed 04:07PM - 20 Jun 16 UTC

nik9000

discuss :Core/Infra/REST API v5.0.0-alpha4

Lots of Elasticsearch tasks are forked onto a bunch of shards. When those shards… fail, Elasticsearch returns the failures in the json at `_shards.failures` including the HTTP response code that that failure deserves _but_ it will return an HTTP 200 response code so long as a single shard succeeds. I think it should return a non-200 HTTP status if any shard fails. Do we want to: 1. Continue as we are and return a 200 code if any shard succeeds. 2. Return a non-200 code if any shards fail. We'd return the highest numbered failure because that is what we do now if _all_ shards fail. You can get each response code in the `failures` array in the response. We could also talk about [RFC 4918](https://tools.ietf.org/html/rfc4918)'s `207 Multi-Status` but at first glance that specifies some XML response we aren't going to implement. 3. Add a boolean to the request to toggle between the two behaviors. We'd have to pick a default but we could default to the old behavior for 5.0 if we didn't want the change to be breaking.

How do people typically handle shard failures in their results?

system · May 11, 2018, 8:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reproduce 206 Partial result Elasticsearch	4	1051	May 19, 2021
How to recreate shard failure Elasticsearch	4	1141	August 19, 2022
Shard failure after restart of node - ES 1.7.5 Elasticsearch	7	673	July 5, 2017
Recreating Shard failures Elasticsearch	1	346	October 26, 2018
"failed shard on node... ...Data too large, data for [<transport_request>] would be" only for 3 most recent .monitoring-es indices Elasticsearch	9	4967	March 26, 2020

Dealing with node failures

Related topics