ElasticSearch Nodes go HIGH cpu

We are having a really strange problem with our ElasticSearch cluster. Every couple of days or so, a node in our 4 node cluster goes high cpu, with a ton of reads.

We have checked the tasks page, and it doesn't appear that there is too much going on, maybe 10-15 tasks total for the node.

This switches on nodes, so sometimes it may be 2, and sometimes it may be 4 that goes high cpu.

We are using the following architecture:

  • 12 core / 24 thread cpu
  • 128Gb RAM
  • 7 TB Drives RAID-0

Elastic Search 2.3.1

3 instances on each node.

1 - Master
2 - Data1
3 - Data2

We have about 10 billion records spread across 3000 indices. Each with 3 shards and 1 replica. There are several dashboards that were created around 1.5 years ago that run constantly. This problem started around 1 month ago, and seems to be getting worse.

How would we go about diagnosing what is happening during these high CPU cycles.

Use the hot threads API to get some insights into what is going on when are experiencing these periods of high CPU usage.

We have been using tasks and hot_threads, and we do not see much as far as hot threads, other than the management tasks, during this period.

It seems we are running into an issue where the data1 and data2 instances are both trying to get IO, and getting blocked at the IO Wait.

We are working on getting each instance on its own RAID array, currently both instances go through the same RAID array.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.