How to diagnose when indexing performance suddenly droped

Hi. I'm running a es cluster which has 50 data nodes, 20 for a logging index, 20 for a data index, 10 for other indices, and 4 coordinators and 3 master nodes.

One day the indexing performance of the logging index was suddenly dropped.

  • es 7.4.0 + docker environment
  • cpus was used about 10%
  • gc was less than 5ms
  • ssd disk io was about 20MB/s
  • hot threads didn't exists in most logging data nodes.
    -- https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html
  • tasks manager show lot of actions 'indices:data/write/bulk[s]' or '[s][p]' in both coordinator and data nodes with high running_time_in_nanos higher than 5,000,000,000
    -- all coordinator nodes : lot of status were rerouted
    -- data nodes of logging index : log of status were waiting_on_primary, primary. shard numbers were different
    -- other data nodes : few indexing action with low running_time_in_nanos

After I restart containers, it back to normal.

  • cpus went high closed to 80%
  • gc was about 20ms
  • ssd disk io was abount 450MB/s
  • no task with long running_time_in_nanos

It looked like indexing was stuck somewhere between requests of indexing accepted and writing it into shard. I don't know how to figure it out. Is there any tool to use?

Welcome :slight_smile:

Ideally you should split those use cases out, so that they don't impact each other and then the users of each dataset.

Do you have Monitoring enabled?

Thank to reply.

While indexing of the logging data got slow, other indexing of indices hadn't any problem. It would be good to split cluster for each purpose, but allocating index sperately was enough.

All metric are collected using fluentd. I looked into machine cpu, memory, io of all es nodes. Since the task manager api said indexing was slow, I guess logstash was not a candidate causes of problems.

Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.