Hey,
we're having issues with timeouting search queries on our 61 node es cluster.
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='XXXXXX', port=9200): Read timed out. (read timeout=300))
Debugging showed that some nodes have a high number of active (and rejected) search tasks.
$escurl -XGET "https://a6es-e.ng.seznam.cz:9200/_cat/thread_pool/*?v&h=node_name,name,active,rejected,completed" | sort -nk3 | awk '{if ($3 > 5) print $0}'
node_name name active rejected completed
a6es-e8-es1 search 23 87776 1965656928
a6es-e4-es0 search 28 11525 2027376804
a6es-e3-es0 search 32 364 122469486
I did some digging and changed refresh_interval
on indices from 1s
to 60s
- this helped a lot (since we have quite heavy writes) and the issue basically disappeared for a month.
Sadly last week, the issue started creeping on us again and I don't think additional increase of refresh_interval
will help. I don't think that writing load on the cluster changed in the last two weeks. I've added few automatic cron searches against the cluster, but it should be negligible compared to the rest of manual and automated queries.
Cluster admins gave me access to logs on one of currently affected nodes, but without knowing what should I look for, it's a needle in a haystack.
Is there a way to figure out what those threads do (queries or at least indices they are "working on")?
Anything that would kick me in the right direction to solve this problem would be appreciated.
Thanks.