High CPU & high IOPS on StressTest


(Radap) #1

I recently have upgraded my ES from 1.5.2 to 2.2.0 version and add Shield to it. I`m trying to perform a stress test by using Locust that blast the cluster with data (by nodejs app).
I got strange results comparing to the previous stress test (on 1.5.2):

		1.5.2 ver		2.2.0 ver

cpu		50% avg, 90% peak	87% avg, 96% peak

IOPS	        30 avg, 300 peak	800 avg, 1122 peak

Why ES working so hard?

Another strange thing that I cant understand, and I think is connected to the above, is the output in plugin head.
Previously (1.5.2) I saw indices store data as:

Index_name
size: 10.3Gi (20.6Gi)
docs: 17,073,010 (17,073,010)

But now (2.2.0) it is as:

Index_name
size: 13.7Gi (29.3Gi)
docs: 10,217,220 (20,434,440)

As you can see, the data double itself in ES 2.2.0, why it is happening?
There something wrong with my v2.2.0 ES configurations?


#2

Hi! Would you show your index mapping?

My guess is:
Your index have replicas so you get double size in documents and disk usage (from head plugin). As for CPU if you don't disable doc_values in mapping template (as for 2.x it's enabled by default, that gives increase in CPU%, disk space, IOPS). As for IOPS increase it's may be combination of doc_values and synced flush of translog (enabled by default).


(Zachary Tong) #3

Agreeing with the points @rusty raised: Doc values on by default adds some CPU/IO overhead and some more disk space, translog flushes on every action now (instead of every 5s) and the replica issue.

In addition to that, there was a change at the Lucene layer. Incoming blob of text, but the tl;dr is that Lucene identifies idle resources and utilizes them, making the resource usage look higher when it's really just getting work done faster.

So, in Elasticsearch 1.x, we forcefully throttled Lucene's segment merging process to prevent it from over-saturating your nodes/cluster.

The problem is that a strict threshold is almost never the right answer. If you are indexing heavily, you often want to increase the threshold to let Lucene use all your CPU and Disk IO. If you aren't indexing much, you likely want the threshold lower. But you also want it to be able to "burst" the limit for one-off merges when your cluster is relatively idle.

In Lucene 5.x (used in ES 2.0+), they added a new style of merge throttling that monitors how active the index is, and automatically adjusts the throttle threshold (see https://issues.apache.org/jira/browse/LUCENE-6119, https://github.com/elastic/elasticsearch/pull/9243 and https://github.com/elastic/elasticsearch/pull/9145).

In practice, what this means is that your indexing tends to be faster in ES 2.0+ because segments are allowed to merge as fast as your cluster can handle, without over-saturating your cluster. But it also means that your cluster will happily use any idle resources, which is why you see more resource utilization.

Basically, Lucene identified that those resources weren't being used...so it put them to work to finish the task faster. :slightly_smiling:


(Radap) #4

Hi! :slight_smile:
I didn't mentioned it, but yes, I do have replicas.
So, to get it straight, doc_values were disabled by default in elastic 1.5.x?


(Radap) #5

Very interesting, Thank You!


#6

Yes it were.


(system) #7