Unexpected read i/o on es nodes

(Horst Birne) #1


over the past few days i encountered a strange behaviour with one of our ES nodes. Although only indexing is running on this nodes and no queries are executed, the server still have quite high (~45mb/s) read i/o´s. Because the nodes have built in SSDs the raw read i/o`s arent a problem, but we have zfs compression activated on the nodes, so it generate some CPU load.

The green line is the read i/o´s, the red line the write i/o´s.

Could this be related to the merges, that happen while indexing? But if so, why does this behaviour dont show up on other indexing nodes?

Thanks for any response.

(Zachary Tong) #2

My first guess was merges too. When segments merge, the data from all the segments in the merge are stream-merge-sorted together, which will slurp down read I/O in addition to write.

Now, why it's not showing on other nodes...I'm not sure. Random thoughts in no particular order:

  • Snapshot taking place in the background, or shards being relocated to a different node?
  • If you plot active merges, do you see even merge activity across all your nodes?
  • Does this node have "hot" indices while the other nodes have older, inactive indices?
  • Do you have any other services running on that node that might be eating resources?
  • Perhaps this node has merge throttling turned off, while the others are throttled and thus not using as many resources?
  • Are you using custom routing, preference or some other mechanism that might make a node "hot" ?

(Horst Birne) #3

Thanks for sharing your thoughts,

There are nether snapshotting nor shard moving going on. The index is spread equally across nodes with no replica assigned to it. We currently dont use custom routing and the node is a standalone elasticsearch server but regarding merging there is something interesting going on:

Turning on debug log on the node, it turned out that there is an quite excessive segment merging going on:

[2015-08-26 09:06:41,142][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][2] merge segment [_qci] done: took [1.4m], [12.1 MB], [7,252 docs]
[2015-08-26 09:06:41,558][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][10] merge segment [_qwd] done: took [2.8m], [26.2 MB], [19,111 docs]
[2015-08-26 09:06:42,042][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][10] merge segment [_qww] done: took [1.9m], [17.2 MB], [12,051 docs]
[2015-08-26 09:06:59,605][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][8] merge segment [_tp4] done: took [1.1m], [10.1 MB], [5,767 docs]
[2015-08-26 09:07:02,634][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][4] merge segment [_q94] done: took [10.7m], [95.7 MB], [76,401 docs]
[2015-08-26 09:07:06,717][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][6] merge segment [_qbv] done: took [1.5m], [14.1 MB], [8,574 docs]
[2015-08-26 09:07:07,975][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][4] merge segment [_qfb] done: took [1.6m], [14.8 MB], [9,040 docs]
[2015-08-26 09:07:09,500][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][3] merge segment [_q7z] done: took [1.5m], [12.2 MB], [7,240 docs]
[2015-08-26 09:07:27,616][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][4] merge segment [_qg9] done: took [27.3s], [2.5 MB], [1,037 docs]
[2015-08-26 09:07:32,393][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][6] merge segment [_qcu] done: took [29.3s], [3.8 MB], [1,605 docs]
[2015-08-26 09:07:54,782][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][8] merge segment [_tpl] done: took [1.3m], [6.6 MB], [3,725 docs]
[2015-08-26 09:08:11,567][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][8] merge segment [_tp3] done: took [2.3m], [17.1 MB], [10,369 docs]
[2015-08-26 09:08:14,488][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][2] merge segment [_qdj] done: took [1.1m], [7.0 MB], [3,816 docs]
[2015-08-26 09:08:14,495][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][3] merge segment [_q6t] done: took [4.4m], [32.1 MB], [23,817 docs]
[2015-08-26 09:08:21,565][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][10] merge segment [_qxn] done: took [2.2m], [14.6 MB], [8,733 docs]
[2015-08-26 09:08:22,182][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][2] merge segment [_qco] done: took [2.6m], [17.5 MB], [10,817 docs]
[2015-08-26 09:08:31,065][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.26][3] merge segment [_q8u] done: took [1.3m], [8.2 MB], [4,797 docs]

On the other node, where the index lies, there is hardly any merging going on which is a bit strange, considering that an equally amount of shards lie on each nodes with no routing specified.

Regarding settings:

curl localhost:9200/_cluster/settings?pretty
"persistent" : {
"cluster" : {
"routing" : {
"allocation" : {
"cluster_concurrent_rebalance" : "6",
"node_concurrent_recoveries" : "10",
"disk" : {
"watermark" : {
"low" : "93%",
"high" : "96%"
"node_initial_primaries_recoveries" : "10"
"indices" : {
"store" : {
"throttle" : {
"type" : "merge",
"max_bytes_per_sec" : "300mb"
"recovery" : {
"concurrent_streams" : "15",
"max_bytes_per_sec" : "1500mb"

curl localhost:9200/_nodes/node/settings?pretty
"merge" : {
"scheduler" : {
"max_thread_count" : "10"
"policy" : {
"segments_per_tier" : "120",
"max_merge_at_once_explicit" : "30",
"max_merge_at_once" : "10"

These are the settings regarding merging, the node each have 64 cpu cores, so 10 threads for merging is ok imo.

Any additional thoughts on this? Thanks

(Zachary Tong) #4

The number of merges is not terribly worrying, Lucene performs merges all the time. Small segments are merged continuously into larger tiers.

The length of time per merge, however, is crazy slow. Merging 12mb should not take 1.4 minutes!

First thing you should probably do is verify the health of disks on that node. You could be seeing the death rattle of a dying drive, or a RAID that is in parity mode with reduced performance. Related, I'd investigate the configuration of that node to make sure ZFS is behaving.

Since you are using SSDs, is your IO scheduler configured to use noop or deadline? I'm not sure if ZFS changes how the io scheduler works tbh, but most distros default to cfq, which is optimized for spinning media. noop can provide like a 500x improvement in throughput. Unsure what the equivalent is if you aren't on Linux, but make sure you're using the SSD-optimized scheduler for your platform

  • What's the segment count on that misbehaving node? Is it a lot higher than the rest of the nodes?
  • In your Node Stats, how many merges are active?
  • What version of ES are you on?
  • How is the CPU during this time? Is it over-taxed? How powerful are those 64 cores? Merging can be CPU intensive too, since it is basically a glorified merge-sort. If the CPU is bottlenecking, merges will slow even if you have excess disk IO
  • Are you running Curator or some process to optimize old indices? Related, any other services on this box that could be stealing resources?

If you are willing (e.g. nothing sensitive in the output), a copy of your Node Stats would be helpful =)

I don't think it is causing the current problem, but I'd remove those merge scheduler tweaks. Your segments_per_tier is huge, and that could have serious downstream effects on Lucene performance which are hard to predict. The tricky thing with these settings is that they behave differently when an index is new + small segments vs. old with many full tiers and large segments. Small tweaks might appear to boost performance early on, but then trigger "stampeding herd" problems later which crush your node

Also, the thread count is per-shard, not per node. So if you have 10 actively indexing shards, that's 100 active threads with your settings. That will trigger the in-flight throttling a lot earlier, and cause contention amongst the threads (even with your large box).

All in all, it's almost never a good idea to fiddle with the Lucene merge settings. I've been working with ES for 4+ years and I still don't feel comfortable adjusting them, since the downstream effects are so hard to predict =)

(Horst Birne) #5

Thank your for sharing your knowledge with me/us, i really appreciate that!.

First, my Version is 1.7.1. I am using curator to optimize older indices, that lied on "storage" nodes, the newest indices dont get optimized manually.

Looking at our zfs settings, it turned out that the noop scheduler was already aktivated:

cat /sys/block/sdb/queue/scheduler
[noop] deadline cfq

After that i followed your suggestion and removed all merging settings from my node config and after the restart of the node, the merge times suddendly went back to normal again!.

[2015-08-28 08:01:53,408][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.28][4] merge segment [_ck3] done: took [1.2m], [239.4 MB], [201,839 docs]
[2015-08-28 08:01:54,902][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.28][5] merge segment [_cps] done: took [2.3m], [465.9 MB], [387,796 docs]
[2015-08-28 08:04:26,824][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.28][9] merge segment [_cqr] done: took [6m], [1,578.8 MB], [1,328,504 docs]
[2015-08-28 08:04:27,216][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.28][8] merge segment [_crt] done: took [58.5s], [351.9 MB], [285,304 docs]
[2015-08-28 08:06:54,684][DEBUG][index.merge.scheduler ] [node] [logstash-2015.08.28][4] merge segment [_btr] done: took [14.6m], [2,784.1 MB], [2,325,318 docs]

Also, elasticsearch started to merge the segments of the older indices, that still lied on the node.(i guess to to the segments_per_tier setting was removed)

The CPU load went down to 30-40, which is ok for a 64core box. Before the change and with the slow merging times, the CPU was at 90-120.

The first few hours, the read i/o increased quite a bit, but after the older indices were done with merging, it normalized.

Conclusion: Never play with merge settings if your aren´t very very experienced with lucene/elasticsearch!.

Thank you for your great help.


(Zachary Tong) #6

Ah, great! Glad that was a (relatively) easy fix. :smile:

I'm actually pretty surprised that the minimal merge tweaks had that big of an effect, I would have guessed it was something else. But I guess that just shows how unpredictable they can be! =)

(system) #7