When is an error definitely a permanent failure?

Mo_B · December 15, 2025, 3:10pm

I am using the .NET Elasticsearch client v 9.1.0. I want to check when a response represents a guaranteed permanent failure, i.e. a retry would be guaranteed to fail again (e.g. if the request is malformed, the document doesn’t match the mapping, the payload is too large, etc.).

I have the following list of HTTP status codes so far:

400 Bad Request
404 Not Found
405 Method Not Allowed
409 Conflict
413 Payload Too Large
415 Unsupported Media Type

Are there any more? And I’m also not sure about 400: is it guaranteed that this is a permanent failure, or could a 400 also be returned for a transient problem like timeout (and if so, is it possible to differentiate them)?

DavidTurner · December 15, 2025, 4:59pm

I’m not sure you can do this with HTTP status codes alone - the HTTP spec defines these codes very precisely which makes it hard to use them for non-HTTP-related decision-making.

Of the list you gave I’d expect 409 Conflict to be worth retrying (maybe after resolving the conflict) but the others sound permanent indeed. 400 Bad request should be because the request is genuinely bad - a timeout would usually fall under 429 Too Many Requests.

leandrojmp · December 15, 2025, 7:11pm

A 409 (conflict) is guaranteed to fail until the reason for the conflict is solved.

Logstash for examle does not retry 409 errors, it logs a warning and drop the event.

linkerc · December 15, 2025, 7:39pm

What do you mean by “permanent failure”? The writes with an error will result in data (or partial data with bulk write) not being stored in the DB. And a lot of the failures are data dependent. If your writer changes the data pattern, then the failure will go away, etc. (Mapping error is one such scenario where you write string instead of integer or vice versa).

If you are talking about reading, then it’s either due to networking with temporary access issue or your hostname DNS config change transition (still networking related) period.

The way we operates the cluster is to look into every error. There shouldn’t be any error during write (which is what we care about the most). Reading error is often timeout due to some complicated aggregation. In that case, we investigate the application logic for solutions.

DavidTurner · December 16, 2025, 7:35am

Well yeah but it could be that the conflict was because of two concurrent updates, in which case there’s a good chance that a retry will see no such conflict.

This is the way.

Mo_B · December 16, 2025, 8:24pm

By permanent failure I mean that resending the exact same request would produce the same error. Basically, I was mainly unsure about 400 (Bad Request) which I thought only happened for genuinely bad requests, but Claude AI stated it could possibly also happen due to timeout. So I guess it's plainly wrong, and 400 really means it's non-retryable? Maybe it got confused because of this thread: X-Pack -"Monitoring: Error 400 Bad Request: Request Timeout after 60000ms"

DavidTurner · December 17, 2025, 7:58am

That’s the intention yes, but to repeat: HTTP status codes are very coarse, it’s hard to get this kind of nuance from the status code alone. It’s unclear to me what was actually happening in the 8-year-old thread you linked, but that was a long time ago.

Topic		Replies	Views
Knowing when and when not to retry a request based on ElasticsearchException or IOException with the RestHighLevelClient Elasticsearch	2	3565	June 28, 2019
Logstash loses records on 500 status code Logstash	1	1112	August 16, 2017
HTTP response codes when creating a document Elasticsearch	4	8870	December 18, 2019
Got a bad response code from server, but this code is not considered retryable Logstash	4	2263	April 19, 2018
ElasticSearch retry vs failure exceptions Elasticsearch	1	994	July 5, 2017

When is an error definitely a permanent failure?

Related topics