Retry logic not implemented for all HTTP calls between elasticsearch-hadoop and Elasticsearch

During chaos testing, we noticed that the elasticsearch-hadoop library failed without retrying when the Master node of the Elasticsearch cluster was killed. The logs showed that the call to discover ES version failed:

Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[https://elasticsearch:9200] returned [503|Service Unavailable:]
  at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:505)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:463)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
  at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:429)
  at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:155)
  at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:637)
  at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:276)

It seems that the retry logic is only implemented for Bulk calls. To improve the resiliency of the elasticsearch-hadoop library could you please add retry logic to all REST calls from the elasticsearch-hadoop library to Elasticsearch cluster.

We are using version elasticsearch-hadoop library version 5.4.

I think the issue here is that a 503 from other calls is not an expected regular outcome for operation. The bulk endpoint explicitly uses 503 (and in more recent versions 429) as a code to inform the client that the bulk queues are full and to try again later. The other endpoints are not likely to return this status code, and thus we break eagerly since it could be a configuration problem that should be fixed, a party in between the client and server has deemed it fitting to return a 503 which is disruptive, or the server is legitimately unavailable.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.