Elasticsearch read timed out

I have a 4 nodes elasticsearch cluster and they're some m4.2xlarge machines. I have processes indexing while some processes querying the index. And I got a read timed out error from elasticsearch.

I'm using the elasticsearch-py and elasticsearch-dsl library. I've tried to increase the timeout to 60 seconds and max retry to 10 but still getting the timeout error (in some of the processes). What is the underlying cause of this time out error? And how can I resolve the time out error?

This is the error I got:

elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'es_cluster', port=9200): Read timed out. (read timeout=60))

Anything will be helpful. Appreciate it!

Hard to say, a timeout is a pretty generic error. Essentially, the node didn't respond and so it timed out.

I'd verify that the retries are set correctly (and perhaps enable logging to make sure it's actually retrying?).

Barring that, you'll need to determine why the servers are timing out. Things to check:

  • Are there long GC's being logged in the Elasticsearch server logs? Long GCs will effectively make the node block all communication until the GC finishes. So if you're experiencing a lot of GCs, this can manifest as timeous on the client's side
  • DNS issues?
  • Are the nodes actually reachable from the client? Network partitions?
  • Are the query and index/bulk threadpools rejecting connections?
  • Verify that all the nodes are using the same ports (e.g. 9200) and they are accessible externally?