High CPU Load schedualed at specific time


(acv2) #1

I'm using elasticsearch 1.5

and it is working perfectly the most part of the time, but everyday at the same time it becomes crazy, CPU % goes to ~70% when the average is around 3-5% there are SUPER servers with 32GB reserved for lucene, swap it is lock and clearing the cache doesn't solve the problem (it doesn't take down the heap mem)

Settings:

3 servers (nodes) 32 cores and 128GB RAM each
2 buckets (indices) one with ~18 million documents (this one doesn't receive updates pretty often just indexing new docs) the other one have around 7-8 million documents but we are constantly bombarding it with updates search delete and indexing as well

The best distribution for our structure, was to have only 1 shard per node with not replicas, we can afford to have a % of the data off for few seconds, that will be back as soon as the server get online again, and this process is fast enough since it doesn't need to relocate anything. previously we used to have 3 shards with 1 replica, but the issue mentioned above occurs as well, so is easy to figure it out that the problem is not related with the distribution.

Things that I already tried,

Merging, i try to use the Optimize API trying to give less load to the schedule merge, but actually the merging process takes a lot of R/W of the disk but it doesn't affect substantially the mem or the CPU load.

Flushing, I tried to flush with long and shot intervals, and the results were the same nothing changed, since flushing affects directly the merging process and as mentioned above, merging process doesn't takes that much of the CPU or mem usage.

managing the cache, clearing it manually but it doesn't seems to take the cpu load to normal state not even for a moment.

Here is the most of the elasticsearch.yml configs

<nabble_a href="elasticsearch-yml.txt">elasticsearch-yml.txt</nabble_a>

here is the stats when the server is in a normal state:
<nabble_a href="node_stats_normal.txt">node_stats_normal.txt</nabble_a>

Node stats during the problem.
<nabble_a href="node_stats.txt">node_stats.txt</nabble_a>

When the server is in a normal state

<nabble_img src="pic2-2.png" border="0"/>

When the server is taking really big on the CPU

<nabble_img src="pic2-1.png" border="0"/>

I will appreciate any help or discussion that can point me in the right direction to get rid of this behavior

thanks in advance..

Regards,

Daniel


(system) #2