I recently have upgraded my ES from 1.5.2 to 2.2.0 version and add Shield to it. I`m trying to perform a stress test by using Locust that blast the cluster with data (by nodejs app).
I got strange results comparing to the previous stress test (on 1.5.2):
1.5.2 ver 2.2.0 ver
cpu 50% avg, 90% peak 87% avg, 96% peak
IOPS 30 avg, 300 peak 800 avg, 1122 peak
Why ES working so hard?
Another strange thing that I cant understand, and I think is connected to the above, is the output in plugin head.
Previously (1.5.2) I saw indices store data as:
My guess is:
Your index have replicas so you get double size in documents and disk usage (from head plugin). As for CPU if you don't disable doc_values in mapping template (as for 2.x it's enabled by default, that gives increase in CPU%, disk space, IOPS). As for IOPS increase it's may be combination of doc_values and synced flush of translog (enabled by default).
Agreeing with the points @rusty raised: Doc values on by default adds some CPU/IO overhead and some more disk space, translog flushes on every action now (instead of every 5s) and the replica issue.
In addition to that, there was a change at the Lucene layer. Incoming blob of text, but the tl;dr is that Lucene identifies idle resources and utilizes them, making the resource usage look higher when it's really just getting work done faster.
So, in Elasticsearch 1.x, we forcefully throttled Lucene's segment merging process to prevent it from over-saturating your nodes/cluster.
The problem is that a strict threshold is almost never the right answer. If you are indexing heavily, you often want to increase the threshold to let Lucene use all your CPU and Disk IO. If you aren't indexing much, you likely want the threshold lower. But you also want it to be able to "burst" the limit for one-off merges when your cluster is relatively idle.
In practice, what this means is that your indexing tends to be faster in ES 2.0+ because segments are allowed to merge as fast as your cluster can handle, without over-saturating your cluster. But it also means that your cluster will happily use any idle resources, which is why you see more resource utilization.
Basically, Lucene identified that those resources weren't being used...so it put them to work to finish the task faster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.