Hanging active search threads

we're having issues with timeouting search queries on our 61 node es cluster.

elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='XXXXXX', port=9200): Read timed out. (read timeout=300))

Debugging showed that some nodes have a high number of active (and rejected) search tasks.

$escurl -XGET "https://a6es-e.ng.seznam.cz:9200/_cat/thread_pool/*?v&h=node_name,name,active,rejected,completed" | sort -nk3    | awk '{if ($3 > 5) print $0}'
node_name       name                active rejected  completed
a6es-e8-es1     search                  23    87776 1965656928
a6es-e4-es0     search                  28    11525 2027376804
a6es-e3-es0     search                  32      364  122469486

I did some digging and changed refresh_interval on indices from 1s to 60s - this helped a lot (since we have quite heavy writes) and the issue basically disappeared for a month.

Sadly last week, the issue started creeping on us again and I don't think additional increase of refresh_interval will help. I don't think that writing load on the cluster changed in the last two weeks. I've added few automatic cron searches against the cluster, but it should be negligible compared to the rest of manual and automated queries.

Cluster admins gave me access to logs on one of currently affected nodes, but without knowing what should I look for, it's a needle in a haystack.

Is there a way to figure out what those threads do (queries or at least indices they are "working on")?

Anything that would kick me in the right direction to solve this problem would be appreciated.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.