%iowait growing with index size


(slimsuperhero) #1

Hello,

we need the help of community.

We have two nodes ElasticSearch cluster setup.
We are indexing ~10k docs per second (nginx access log entries, actually).
New index is created every day at 4 AM (local time). Each index has 6
shards. The index size for the whole day is ~210Gb.

http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/04.png

A week ago, when we had only one node (called es1), the %iowait was growing
with the size of the index. By 9 PM it was reaching 60% and node became
almost unresponsible.
After deletion of the current index %iowait falled to 2% while indexing was
as intensive as earlier. That's why I'm sure that it is not hardware
performance issue.
Now we are using ElasticSearch to fetch some analytics from logs "on the
fly", so deletion the index is not very harmful yet.

Anyway, first of all we decided to add the second node (called es2) to
loadbalance the I/O.
The nodes have almost identical hardware RAID controller with RAID-5 and
writes performance is very good.

We were very surprised when the next morning we found out that %iowait on
es1 keeps to be low (~3%) while es2 repeats the previous scenario. %iowait
skyrockets after several hours.
So when %iowait becomes very large we are forced to delete the current index
(it is recreated automatically). After that %iowait falls down to almost
zero. Indexing rate is the same all the time (~10 docs per second).
http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/111.png

When %iowait on es2 skyrockets we can see massive reading form disk and huge
tps (tps ~2500 and ElasticSearch is constantly reading something from disk
at ~30MB/s). At the same time %iowait on es1 is low and there are not many
reads in comparison to es2. The documents quantity is the same on es1 and
es2 (3 shards on each node).

I think the merging is involved but ElasticHQ reports about ~5MB/s merging
rate on each node. Note that reading rate on es2 is about 30 MB/s when
experiencing the issue (we can see it from iostat).

I have no idea how to fix the issue and - the main thing - to understand
what is going on.

Any help would be greatly appreciated!
Thanks!

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/iowait-growing-with-index-size-tp4044712.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #2

Hi,

Underneath each ES index is a Lucene index... actually one Lucene index for
one ES shard. Lucene performs index segment merges. What Lucene on es1
does and when it does it is independent of what and when it does it on es2.
That's why you don't see the same thing happening at the same time on both
es1 and es2. If you peek at Lucene index directories a bit you'll probably
see segment merging during spikes you are seeing.

Don't know about ElasticHQ, but SPM will show you relevant stats by node,
index, and shard, so you'll be able to narrow things down to
node>index>shard. If you can capture any sort of logging and feed those
logs into Logsene (hey, it's got an ES API, so you'll know how to feed
it!), then you'll be able to easily correlate both ES metrics you see in
SPM and any sort of application/ES logs that would tell you what ES is
doing, thus hopefully helping you troubleshoot this faster.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/http://www.google.com/url?q=http%3A%2F%2Fsematext.com%2F&sa=D&sntz=1&usg=AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg

On Thursday, November 21, 2013 8:30:19 AM UTC-5, Gennady Aleksandrov wrote:

Hello,

we need the help of community.

We have two nodes ElasticSearch cluster setup.
We are indexing ~10k docs per second (nginx access log entries, actually).
New index is created every day at 4 AM (local time). Each index has 6
shards. The index size for the whole day is ~210Gb.

<http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/04.pnghttp://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Ffile%2Fn4044712%2F04.png&sa=D&sntz=1&usg=AFQjCNHkYxpCQLKhZlYD0rz6dHwSeZQtLw>

A week ago, when we had only one node (called es1), the %iowait was
growing
with the size of the index. By 9 PM it was reaching 60% and node became
almost unresponsible.
After deletion of the current index %iowait falled to 2% while indexing
was
as intensive as earlier. That's why I'm sure that it is not hardware
performance issue.
Now we are using ElasticSearch to fetch some analytics from logs "on the
fly", so deletion the index is not very harmful yet.

Anyway, first of all we decided to add the second node (called es2) to
loadbalance the I/O.
The nodes have almost identical hardware RAID controller with RAID-5 and
writes performance is very good.

We were very surprised when the next morning we found out that %iowait on
es1 keeps to be low (~3%) while es2 repeats the previous scenario. %iowait
skyrockets after several hours.
So when %iowait becomes very large we are forced to delete the current
index
(it is recreated automatically). After that %iowait falls down to almost
zero. Indexing rate is the same all the time (~10 docs per second).
<http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/111.pnghttp://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Ffile%2Fn4044712%2F111.png&sa=D&sntz=1&usg=AFQjCNHRo1qBkMejK3xjPH9mRxloiyGgTw>

When %iowait on es2 skyrockets we can see massive reading form disk and
huge
tps (tps ~2500 and ElasticSearch is constantly reading something from disk
at ~30MB/s). At the same time %iowait on es1 is low and there are not many
reads in comparison to es2. The documents quantity is the same on es1 and
es2 (3 shards on each node).

I think the merging is involved but ElasticHQ reports about ~5MB/s merging
rate on each node. Note that reading rate on es2 is about 30 MB/s when
experiencing the issue (we can see it from iostat).

I have no idea how to fix the issue and - the main thing - to understand
what is going on.

Any help would be greatly appreciated!
Thanks!

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/iowait-growing-with-index-size-tp4044712.htmlhttp://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Fiowait-growing-with-index-size-tp4044712.html&sa=D&sntz=1&usg=AFQjCNGGCHYQ2_DPSi8Td2lE0wN_FFjB1g
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3