When is an error definitely a permanent failure?

I am using the .NET Elasticsearch client v 9.1.0. I want to check when a response represents a guaranteed permanent failure, i.e. a retry would be guaranteed to fail again (e.g. if the request is malformed, the document doesn’t match the mapping, the payload is too large, etc.).

I have the following list of HTTP status codes so far:

  • 400 Bad Request
  • 404 Not Found
  • 405 Method Not Allowed
  • 409 Conflict
  • 413 Payload Too Large
  • 415 Unsupported Media Type

Are there any more? And I’m also not sure about 400: is it guaranteed that this is a permanent failure, or could a 400 also be returned for a transient problem like timeout (and if so, is it possible to differentiate them)?

I’m not sure you can do this with HTTP status codes alone - the HTTP spec defines these codes very precisely which makes it hard to use them for non-HTTP-related decision-making.

Of the list you gave I’d expect 409 Conflict to be worth retrying (maybe after resolving the conflict) but the others sound permanent indeed. 400 Bad request should be because the request is genuinely bad - a timeout would usually fall under 429 Too Many Requests.

1 Like

A 409 (conflict) is guaranteed to fail until the reason for the conflict is solved.

Logstash for examle does not retry 409 errors, it logs a warning and drop the event.

1 Like

What do you mean by “permanent failure”? The writes with an error will result in data (or partial data with bulk write) not being stored in the DB. And a lot of the failures are data dependent. If your writer changes the data pattern, then the failure will go away, etc. (Mapping error is one such scenario where you write string instead of integer or vice versa).

If you are talking about reading, then it’s either due to networking with temporary access issue or your hostname DNS config change transition (still networking related) period.

The way we operates the cluster is to look into every error. There shouldn’t be any error during write (which is what we care about the most). Reading error is often timeout due to some complicated aggregation. In that case, we investigate the application logic for solutions.

Well yeah but it could be that the conflict was because of two concurrent updates, in which case there’s a good chance that a retry will see no such conflict.

This is the way.

By permanent failure I mean that resending the exact same request would produce the same error. Basically, I was mainly unsure about 400 (Bad Request) which I thought only happened for genuinely bad requests, but Claude AI stated it could possibly also happen due to timeout. So I guess it's plainly wrong, and 400 really means it's non-retryable? Maybe it got confused because of this thread: X-Pack -"Monitoring: Error 400 Bad Request: Request Timeout after 60000ms"

That’s the intention yes, but to repeat: HTTP status codes are very coarse, it’s hard to get this kind of nuance from the status code alone. It’s unclear to me what was actually happening in the 8-year-old thread you linked, but that was a long time ago.