Constant High (~99%) CPU on 1 of 5 Nodes in Cluster


(michael-4) #1

Hey guys,

We've been running a 5 node cluster for our index (5 shards, 1 replica,
evenly distributed on 5 nodes), and are running into a problem with one of
the nodes in the cluster. It is not unique to any specific node, and can
happen sporadically on any of the nodes.

One of the machines starts spiking up close to 100% CPU Load, and close to
8 OS Load (which is amusing, considering there are only 4 CPU cores on the
machine), while all the other machines operate normally way below those
figures. Naturally, this behavior is accompanied by extremely high write
times, and read times, as well.

Here's what Marvel looks like:

https://lh5.googleusercontent.com/-bxUFPhqAnVk/U9gKg4c19nI/AAAAAAAAABE/S_w68vZ63Uo/s1600/Marvel+-+Node+Statistics.png

Here's all the information we could gather:

Thoughts? insights? Any clues would be greatly appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05b552dc-70fe-4b76-abfb-eb9db2a9dd34%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Kireet Reddy) #2

We've had a very similar issue, but haven't been able to figure out what
the problem is. How do you "fix" the problem? Will a node restart fix the
problem immediately or do you need to restart the whole machine?

On Tuesday, July 29, 2014 1:59:52 PM UTC-7, mic...@modernmast.com wrote:

Hey guys,

We've been running a 5 node cluster for our index (5 shards, 1 replica,
evenly distributed on 5 nodes), and are running into a problem with one of
the nodes in the cluster. It is not unique to any specific node, and can
happen sporadically on any of the nodes.

One of the machines starts spiking up close to 100% CPU Load, and close to
8 OS Load (which is amusing, considering there are only 4 CPU cores on the
machine), while all the other machines operate normally way below those
figures. Naturally, this behavior is accompanied by extremely high write
times, and read times, as well.

Here's what Marvel looks like:

https://lh5.googleusercontent.com/-bxUFPhqAnVk/U9gKg4c19nI/AAAAAAAAABE/S_w68vZ63Uo/s1600/Marvel+-+Node+Statistics.png

Here's all the information we could gather:

Thoughts? insights? Any clues would be greatly appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6660d09b-98f9-41c4-87d0-9ee56890c7b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Greg Murnane) #3

From the Marvel image, it looks like the heap utilization isn't dropping
periodically as it does on the other nodes. Can you verify that GC is
behaving nicely while this occurs?

--
The information transmitted in this email is intended only for the
person(s) or entity to which it is addressed and may contain confidential
and/or privileged material. Any review, retransmission, dissemination or
other use of, or taking of any action in reliance upon, this information by
persons or entities other than the intended recipient is prohibited. If you
received this email in error, please contact the sender and permanently
delete the email from any computer.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/67a3ebba-540b-49f3-ae33-20bc1705c4e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Luis GarcĂ­a Acosta) #4

Sorry for hickhacking the post with more questions, but why is the memory going all the way up and then dropping for the other nodes, is that normal?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db6237d5-2ec2-4a02-b451-e69934d01691%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(smonasco-2) #5

What jumps out at me is your that the CPU work you're doing seems to be very index related, your garbage collections are trying hard on the errant machine and not getting anywhere and you have a lot of deleted docs.

Tell us about your indexing strategy? Tell us things like routing, how bursty it is and maybe why you have so many deleted docs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2a79ccc0-d2cd-4d27-b66e-715d709c838b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(smonasco-2) #6

Also is it always the master node that goes awry?
On Aug 3, 2014 12:54 AM, "smonasco" smonasco@gmail.com wrote:

What jumps out at me is your that the CPU work you're doing seems to be
very index related, your garbage collections are trying hard on the errant
machine and not getting anywhere and you have a lot of deleted docs.

Tell us about your indexing strategy? Tell us things like routing, how
bursty it is and maybe why you have so many deleted docs.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/e1RBjvFSKGU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a79ccc0-d2cd-4d27-b66e-715d709c838b%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFDU5WKEx-jwUVKqrOkXLzfKiOKgUrphzKec6S7E5Z%2Bsm-YjDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(michael-4) #7

Thanks everyone for replying. As it turns out, all our problems stemmed
from our index schema.

Since our app was heavily modeled after social networks, we had to store
our users' followers, and their IDs. To do that, each of our users had an
array called "follower_ids" -- the IDs of the people who are following our
user. Now, that's all fine, until a user like the NBA, or Pepsi comes in
with millions and millions of followers, and that array turns into an
immensely giant array. We also tried turning the array into a nested object
of [{id: 1,}, {id: 2}, ...], but because of the indexing strategy, that
ended up even worse.

We've pinpointed the problem to the part in which we add IDs to the
follower_ids array. Ultimately, we swapped the schema around -- instead of
storing a giant "follower_ids", we started storing "following_ids" --
meaning, each person's document stores which users they follow.

Our current schema works great! CPU never goes above 25%, OS Load stays
consistent, and our cluster is functioning super fast.

On Sunday, August 3, 2014 2:55:21 AM UTC-4, smonasco wrote:

Also is it always the master node that goes awry?
On Aug 3, 2014 12:54 AM, "smonasco" <smon...@gmail.com <javascript:>>
wrote:

What jumps out at me is your that the CPU work you're doing seems to be
very index related, your garbage collections are trying hard on the errant
machine and not getting anywhere and you have a lot of deleted docs.

Tell us about your indexing strategy? Tell us things like routing, how
bursty it is and maybe why you have so many deleted docs.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/e1RBjvFSKGU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a79ccc0-d2cd-4d27-b66e-715d709c838b%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eecf7019-3ecc-4ad2-805b-a6c16cdd2122%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8