Hi. I'm running a es cluster which has 50 data nodes, 20 for a logging index, 20 for a data index, 10 for other indices, and 4 coordinators and 3 master nodes.
One day the indexing performance of the logging index was suddenly dropped.
- es 7.4.0 + docker environment
- cpus was used about 10%
- gc was less than 5ms
- ssd disk io was about 20MB/s
- hot threads didn't exists in most logging data nodes.
- tasks manager show lot of actions 'indices:data/write/bulk[s]' or '[s][p]' in both coordinator and data nodes with high running_time_in_nanos higher than 5,000,000,000
-- all coordinator nodes : lot of status were rerouted
-- data nodes of logging index : log of status were waiting_on_primary, primary. shard numbers were different
-- other data nodes : few indexing action with low running_time_in_nanos
After I restart containers, it back to normal.
- cpus went high closed to 80%
- gc was about 20ms
- ssd disk io was abount 450MB/s
- no task with long running_time_in_nanos
It looked like indexing was stuck somewhere between requests of indexing accepted and writing it into shard. I don't know how to figure it out. Is there any tool to use?