Good Morning,
All nodes (11 data nodes) were 100% cpu after due to some query (or queries) and did not return to normal after 10 minutes.
Monitoring:
https://snapshot.raintank.io/dashboard/snapshot/Vvcn2sxInYPjMoFWtAJv2P0jMg5Bm04R
In the log it was found that some of the queries were spatial.
Using the command "GET _tasks?actions=*search&detailed" we identified that there were more than 500 pending tasks.
We tried to cancel these tasks but we did not succeed.
POST _tasks/_cancel?nodes=MXdSDhr6TaSz3zyfd4GHNA&actions=*search
POST _tasks/_cancel?actions=*search
We then decided to restart all nodes following the rolling update procedure.
This took around 1 hour.
After that everything returned to normal.
ElasticSearch Cluster:
version 2.4.1
3 master
11 data nodes
indices:
green open XXXXXXXXX 20 1 805805 61389 720mb 362.6mb
green open XXXXXXXXX 20 1 225114047 63873085 187.4gb 92.8gb
green open XXXXXXXXX 20 1 241362 0 476.1mb 238mb
green open XXXXXXXXX 20 1 81916 0 26.1mb 13mb
green open XXXXXXXXX 20 0 7746 287 1.1gb 1.1gb
green open XXXXXXXXX 20 1 2817049 494145 2.4gb 1.1gb
green open XXXXXXXXX 20 1 2479279875 1730805009 5.8tb 2.9tb
green open XXXXXXXXX 05 1 13734 11 17.3gb 8.6gb
green open XXXXXXXXX 20 1 63258 458 51.1mb 25.3mb
green open XXXXXXXXX 20 1 975508 169956 963.1mb 484.7mb
green open XXXXXXXXX 20 1 14898463 666475 8.9gb 4.4gb
green open XXXXXXXXX 20 1 1078516 339622 2.3gb 1.1gb
green open XXXXXXXXX 20 1 134646704 23321177 947.6gb 477.1gb
green open XXXXXXXXX 20 1 5898478 1474837 5.6gb 2.8gb
green open XXXXXXXXX 20 1 6036864082 1180357 2.2tb 1.1tb
green open XXXXXXXXX 20 1 1705705965 616230738 2.3tb 1.1tb
green open XXXXXXXXX 20 1 4601595 0 7.5gb 3.7gb
green open XXXXXXXXX 20 1 451058 0 217.8mb 108.9mb
GET /_cluster/settings
{
"persistent": {},
"transient": {
"cluster": {
"routing": {
"allocation": {
"enable": "all"
}
}
},
"threadpool": {
"bulk": {
"queue_size": "500"
}
}
}
}
Could someone help me figure out what could have caused this?
How can we find out?
What to do next time?
Thanks in advanced.