High CPU usage question


(snmaynard) #1

We've been using elasticsearch in production for a month or so, and I'm
seeing strange CPU usage patterns, with really high CPU for periods,
followed by really low CPU for a long time. These patterns cause timeouts
so I'm trying to get to the bottom of them.

Firstly I'm running on an AWS small instance, so it could be just that the
instance is too small - but I dont think so, as CPU usage can be close to
zero for long periods.

I've run hotthreads a couple of times, and the output is available herehttps://gist.github.com/snmaynard/6f90c014e6748f528be3
and here https://gist.github.com/snmaynard/81e69a3a891522ce8ff1

I've also included copies of munin graphs, so you can see what I mean by
strange usage patterns. Theres no real increase in traffic that would
explain this. Is this "hitting a cliff" of performance something you would
expect to see on elasticsearch, or is something else happening here that
would explain it? I am running 0.20.2 on production.

https://lh6.googleusercontent.com/-Rggs1-Up5zM/US7ZN_3MUwI/AAAAAAAAAGk/JIIaCrK9rg8/s1600/cpu-week.png
https://lh6.googleusercontent.com/-EYxeaykYCTI/US7ZLMHziRI/AAAAAAAAAGc/odSZzHdT6s8/s1600/cpu-month.png

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Shay Banon) #2

Seems like its busy doing refresh at the time in query (making recent changes available for search). It might also be that large merges are happening at the time (you can use the node stats API with the indices flag to check if there is an on going large merge going). If thats the case, you can use merge throttling in order to keep that at bay.

On Feb 28, 2013, at 5:18 AM, Simon snmaynard@gmail.com wrote:

We've been using elasticsearch in production for a month or so, and I'm seeing strange CPU usage patterns, with really high CPU for periods, followed by really low CPU for a long time. These patterns cause timeouts so I'm trying to get to the bottom of them.

Firstly I'm running on an AWS small instance, so it could be just that the instance is too small - but I dont think so, as CPU usage can be close to zero for long periods.

I've run hotthreads a couple of times, and the output is available here and here

I've also included copies of munin graphs, so you can see what I mean by strange usage patterns. Theres no real increase in traffic that would explain this. Is this "hitting a cliff" of performance something you would expect to see on elasticsearch, or is something else happening here that would explain it? I am running 0.20.2 on production.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(snmaynard) #3

Previously I was updating documents using an insert with a duplicate id, as
my driver didn't support upsert and it seemed to work ok.

As a gut feeling I thought that might be causing this issue, and so I
changed from inserting an object with a duplicate id, to using the upsert
api.

Within seconds of deploying the code, the CPU usage dropped and load
dropped from 3.5+ to around 0.1. I indexed a new document and everything is
still working.

Does this sound right to you guys? Is there a bug here or was I just doing
something you should never do?

On Wednesday, February 27, 2013 11:59:28 PM UTC-8, kimchy wrote:

Seems like its busy doing refresh at the time in query (making recent
changes available for search). It might also be that large merges are
happening at the time (you can use the node stats API with the indices flag
to check if there is an on going large merge going). If thats the case, you
can use merge throttling in order to keep that at bay.

On Feb 28, 2013, at 5:18 AM, Simon <snma...@gmail.com <javascript:>>
wrote:

We've been using elasticsearch in production for a month or so, and I'm
seeing strange CPU usage patterns, with really high CPU for periods,
followed by really low CPU for a long time. These patterns cause timeouts
so I'm trying to get to the bottom of them.

Firstly I'm running on an AWS small instance, so it could be just that the
instance is too small - but I dont think so, as CPU usage can be close to
zero for long periods.

I've run hotthreads a couple of times, and the output is available herehttps://gist.github.com/snmaynard/6f90c014e6748f528be3
and here https://gist.github.com/snmaynard/81e69a3a891522ce8ff1

I've also included copies of munin graphs, so you can see what I mean by
strange usage patterns. Theres no real increase in traffic that would
explain this. Is this "hitting a cliff" of performance something you would
expect to see on elasticsearch, or is something else happening here that
would explain it? I am running 0.20.2 on production.

https://lh6.googleusercontent.com/-Rggs1-Up5zM/US7ZN_3MUwI/AAAAAAAAAGk/JIIaCrK9rg8/s1600/cpu-week.png

https://lh6.googleusercontent.com/-EYxeaykYCTI/US7ZLMHziRI/AAAAAAAAAGc/odSZzHdT6s8/s1600/cpu-month.png

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Shay Banon) #4

That sounds strange…, effectively, update internally does the same thing as index with the same id…, is ES the only process running on the machine?

On Mar 2, 2013, at 4:06 AM, Simon snmaynard@gmail.com wrote:

Previously I was updating documents using an insert with a duplicate id, as my driver didn't support upsert and it seemed to work ok.

As a gut feeling I thought that might be causing this issue, and so I changed from inserting an object with a duplicate id, to using the upsert api.

Within seconds of deploying the code, the CPU usage dropped and load dropped from 3.5+ to around 0.1. I indexed a new document and everything is still working.

Does this sound right to you guys? Is there a bug here or was I just doing something you should never do?

On Wednesday, February 27, 2013 11:59:28 PM UTC-8, kimchy wrote:
Seems like its busy doing refresh at the time in query (making recent changes available for search). It might also be that large merges are happening at the time (you can use the node stats API with the indices flag to check if there is an on going large merge going). If thats the case, you can use merge throttling in order to keep that at bay.

On Feb 28, 2013, at 5:18 AM, Simon snma...@gmail.com wrote:

We've been using elasticsearch in production for a month or so, and I'm seeing strange CPU usage patterns, with really high CPU for periods, followed by really low CPU for a long time. These patterns cause timeouts so I'm trying to get to the bottom of them.

Firstly I'm running on an AWS small instance, so it could be just that the instance is too small - but I dont think so, as CPU usage can be close to zero for long periods.

I've run hotthreads a couple of times, and the output is available here and here

I've also included copies of munin graphs, so you can see what I mean by strange usage patterns. Theres no real increase in traffic that would explain this. Is this "hitting a cliff" of performance something you would expect to see on elasticsearch, or is something else happening here that would explain it? I am running 0.20.2 on production.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(snmaynard) #5

Yeah, it's the only thing on the machine. It could just be a coincidence that the change coincided with a drop in CPU usage, as I said it did jump around a fair bit, but load has been around 0.1 overnight, so it seems like it has had some kind of affect.

I might drop it back to the old method to see if there is a strong CPU usage correlation and report back to you.

On Saturday, March 2, 2013 at 6:14 AM, kimchy@gmail.com wrote:

That sounds strange…, effectively, update internally does the same thing as index with the same id…, is ES the only process running on the machine?

On Mar 2, 2013, at 4:06 AM, Simon <snmaynard@gmail.com (mailto:snmaynard@gmail.com)> wrote:

Previously I was updating documents using an insert with a duplicate id, as my driver didn't support upsert and it seemed to work ok.

As a gut feeling I thought that might be causing this issue, and so I changed from inserting an object with a duplicate id, to using the upsert api.

Within seconds of deploying the code, the CPU usage dropped and load dropped from 3.5+ to around 0.1. I indexed a new document and everything is still working.

Does this sound right to you guys? Is there a bug here or was I just doing something you should never do?

On Wednesday, February 27, 2013 11:59:28 PM UTC-8, kimchy wrote:

Seems like its busy doing refresh at the time in query (making recent changes available for search). It might also be that large merges are happening at the time (you can use the node stats API with the indices flag to check if there is an on going large merge going). If thats the case, you can use merge throttling in order to keep that at bay.

On Feb 28, 2013, at 5:18 AM, Simon <snma...@gmail.com (javascript:)> wrote:

We've been using elasticsearch in production for a month or so, and I'm seeing strange CPU usage patterns, with really high CPU for periods, followed by really low CPU for a long time. These patterns cause timeouts so I'm trying to get to the bottom of them.

Firstly I'm running on an AWS small instance, so it could be just that the instance is too small - but I dont think so, as CPU usage can be close to zero for long periods.

I've run hotthreads a couple of times, and the output is available here (https://gist.github.com/snmaynard/6f90c014e6748f528be3) and here (https://gist.github.com/snmaynard/81e69a3a891522ce8ff1)

I've also included copies of munin graphs, so you can see what I mean by strange usage patterns. Theres no real increase in traffic that would explain this. Is this "hitting a cliff" of performance something you would expect to see on elasticsearch, or is something else happening here that would explain it? I am running 0.20.2 on production.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com (javascript:).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com (mailto:elasticsearch+unsubscribe@googlegroups.com).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/Xh3wyvk2iiU/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com (mailto:elasticsearch+unsubscribe@googlegroups.com).
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(acv2) #6

I'm using elasticsearch 1.5

and it is working perfectly the most part of the time, but everyday at the same time it becomes crazy, CPU % goes to ~70% when the average is around 3-5% there are SUPER servers with 32GB reserved for lucene, swap it is lock and clearing the cache doesn't solve the problem (it doesn't take down the heap mem)

Settings:

3 servers (nodes) 32 cores and 128GB RAM each
2 buckets (indices) one with ~18 million documents (this one doesn't receive updates pretty often just indexing new docs) the other one have around 7-8 million documents but we are constantly bombarding it with updates search delete and indexing as well

The best distribution for our structure, was to have only 1 shard per node with not replicas, we can afford to have a % of the data off for few seconds, that will be back as soon as the server get online again, and this process is fast enough since it doesn't need to relocate anything. previously we used to have 3 shards with 1 replica, but the issue mentioned above occurs as well, so is easy to figure it out that the problem is not related with the distribution.

Things that I already tried,

Merging, i try to use the Optimize API trying to give less load to the schedule merge, but actually the merging process takes a lot of R/W of the disk but it doesn't affect substantially the mem or the CPU load.

Flushing, I tried to flush with long and shot intervals, and the results were the same nothing changed, since flushing affects directly the merging process and as mentioned above, merging process doesn't takes that much of the CPU or mem usage.

managing the cache, clearing it manually but it doesn't seems to take the cpu load to normal state not even for a moment.

Here is the most of the elasticsearch.yml configs

<nabble_a href="elasticsearch-yml.txt">elasticsearch-yml.txt</nabble_a>

here is the stats when the server is in a normal state:
<nabble_a href="node_stats_normal.txt">node_stats_normal.txt</nabble_a>

Node stats during the problem.
<nabble_a href="node_stats.txt">node_stats.txt</nabble_a>

When the server is in a normal state

<nabble_img src="pic1-1.png" border="0"/>
<nabble_img src="pic1-2.png" border="0"/>

When the server is taking really big on the CPU

<nabble_img src="pic1-2.png" border="0"/>
<nabble_img src="pic2-2.png" border="0"/>

I will appreciate any help or discussion that can point me in the right direction to get rid of this behavior

thanks in advance..

Regards,

Daniel


(system) #7