Constant High (~99%) CPU on 1 of 5 Nodes in Cluster

michael_4 · July 29, 2014, 8:59pm

Hey guys,

We've been running a 5 node cluster for our index (5 shards, 1 replica,
evenly distributed on 5 nodes), and are running into a problem with one of
the nodes in the cluster. It is not unique to any specific node, and can
happen sporadically on any of the nodes.

One of the machines starts spiking up close to 100% CPU Load, and close to
8 OS Load (which is amusing, considering there are only 4 CPU cores on the
machine), while all the other machines operate normally way below those
figures. Naturally, this behavior is accompanied by extremely high write
times, and read times, as well.

Here's what Marvel looks like:

https://lh5.googleusercontent.com/-bxUFPhqAnVk/U9gKg4c19nI/AAAAAAAAABE/S_w68vZ63Uo/s1600/Marvel+-+Node+Statistics.png

Here's all the information we could gather:

Full thread dump from while this
occurred: https://gist.github.com/danielschonfeld/ff6c3744197f2c748632
GET
_nodes/stats: https://gist.github.com/schonfeld/693c8dbf0dd57e4cff7c
GET
_nodes/hot_threads: https://gist.github.com/schonfeld/766d771d211e452a7100
GET
_cluster/stats: https://gist.github.com/schonfeld/d5395f97e3a87745cc1f

Thoughts? insights? Any clues would be greatly appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05b552dc-70fe-4b76-abfb-eb9db2a9dd34%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kireet_Reddy · July 30, 2014, 2:33am

We've had a very similar issue, but haven't been able to figure out what
the problem is. How do you "fix" the problem? Will a node restart fix the
problem immediately or do you need to restart the whole machine?

On Tuesday, July 29, 2014 1:59:52 PM UTC-7, mic...@modernmast.com wrote:

Hey guys,

We've been running a 5 node cluster for our index (5 shards, 1 replica,
evenly distributed on 5 nodes), and are running into a problem with one of
the nodes in the cluster. It is not unique to any specific node, and can
happen sporadically on any of the nodes.

One of the machines starts spiking up close to 100% CPU Load, and close to
8 OS Load (which is amusing, considering there are only 4 CPU cores on the
machine), while all the other machines operate normally way below those
figures. Naturally, this behavior is accompanied by extremely high write
times, and read times, as well.

Here's what Marvel looks like:

https://lh5.googleusercontent.com/-bxUFPhqAnVk/U9gKg4c19nI/AAAAAAAAABE/S_w68vZ63Uo/s1600/Marvel+-+Node+Statistics.png

Here's all the information we could gather:

Full thread dump from while this occurred:
gist:ff6c3744197f2c748632 · GitHub

GET _nodes/stats:
GET _nodes/stats · GitHub

GET _nodes/hot_threads:
GET _nodes/hot_threads · GitHub

GET _cluster/stats:
GET _cluster/stats · GitHub

Thoughts? insights? Any clues would be greatly appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6660d09b-98f9-41c4-87d0-9ee56890c7b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Greg_Murnane · August 1, 2014, 2:28pm

From the Marvel image, it looks like the heap utilization isn't dropping
periodically as it does on the other nodes. Can you verify that GC is
behaving nicely while this occurs?

--
The information transmitted in this email is intended only for the
person(s) or entity to which it is addressed and may contain confidential
and/or privileged material. Any review, retransmission, dissemination or
other use of, or taking of any action in reliance upon, this information by
persons or entities other than the intended recipient is prohibited. If you
received this email in error, please contact the sender and permanently
delete the email from any computer.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/67a3ebba-540b-49f3-ae33-20bc1705c4e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luis_Garcia_Acosta · August 2, 2014, 9:40am

Sorry for hickhacking the post with more questions, but why is the memory going all the way up and then dropping for the other nodes, is that normal?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db6237d5-2ec2-4a02-b451-e69934d01691%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

smonasco_2 · August 3, 2014, 6:53am

What jumps out at me is your that the CPU work you're doing seems to be very index related, your garbage collections are trying hard on the errant machine and not getting anywhere and you have a lot of deleted docs.

Tell us about your indexing strategy? Tell us things like routing, how bursty it is and maybe why you have so many deleted docs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2a79ccc0-d2cd-4d27-b66e-715d709c838b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

smonasco_2 · August 3, 2014, 6:55am

Also is it always the master node that goes awry?
On Aug 3, 2014 12:54 AM, "smonasco" smonasco@gmail.com wrote:

What jumps out at me is your that the CPU work you're doing seems to be
very index related, your garbage collections are trying hard on the errant
machine and not getting anywhere and you have a lot of deleted docs.

Tell us about your indexing strategy? Tell us things like routing, how
bursty it is and maybe why you have so many deleted docs.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/e1RBjvFSKGU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a79ccc0-d2cd-4d27-b66e-715d709c838b%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFDU5WKEx-jwUVKqrOkXLzfKiOKgUrphzKec6S7E5Z%2Bsm-YjDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

michael_4 · August 4, 2014, 4:45pm

Thanks everyone for replying. As it turns out, all our problems stemmed
from our index schema.

Since our app was heavily modeled after social networks, we had to store
our users' followers, and their IDs. To do that, each of our users had an
array called "follower_ids" -- the IDs of the people who are following our
user. Now, that's all fine, until a user like the NBA, or Pepsi comes in
with millions and millions of followers, and that array turns into an
immensely giant array. We also tried turning the array into a nested object
of [{id: 1,}, {id: 2}, ...], but because of the indexing strategy, that
ended up even worse.

We've pinpointed the problem to the part in which we add IDs to the
follower_ids array. Ultimately, we swapped the schema around -- instead of
storing a giant "follower_ids", we started storing "following_ids" --
meaning, each person's document stores which users they follow.

Our current schema works great! CPU never goes above 25%, OS Load stays
consistent, and our cluster is functioning super fast.

On Sunday, August 3, 2014 2:55:21 AM UTC-4, smonasco wrote:

Also is it always the master node that goes awry?
On Aug 3, 2014 12:54 AM, "smonasco" <smon...@gmail.com <javascript:>>
wrote:

What jumps out at me is your that the CPU work you're doing seems to be
very index related, your garbage collections are trying hard on the errant
machine and not getting anywhere and you have a lot of deleted docs.

Tell us about your indexing strategy? Tell us things like routing, how
bursty it is and maybe why you have so many deleted docs.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/e1RBjvFSKGU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a79ccc0-d2cd-4d27-b66e-715d709c838b%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eecf7019-3ecc-4ad2-805b-a6c16cdd2122%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
ElasticSearch Nodes go HIGH cpu Elasticsearch	3	908	August 2, 2017
High CPU on some nodes in the cluster Elasticsearch	7	422	July 6, 2017
CPU for one of the nodes is high frequently Elasticsearch	2	572	September 28, 2017
CPU Usage 100% Elasticsearch	2	719	July 5, 2017
Help please with high CPU utilization on 1 node of cluster :) Elasticsearch	9	11832	July 5, 2017

Constant High (~99%) CPU on 1 of 5 Nodes in Cluster

Related topics