%iowait growing with index size

slimsuperhero · November 21, 2013, 1:30pm

Hello,

we need the help of community.

We have two nodes ElasticSearch cluster setup.
We are indexing ~10k docs per second (nginx access log entries, actually).
New index is created every day at 4 AM (local time). Each index has 6
shards. The index size for the whole day is ~210Gb.

http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/04.png

A week ago, when we had only one node (called es1), the %iowait was growing
with the size of the index. By 9 PM it was reaching 60% and node became
almost unresponsible.
After deletion of the current index %iowait falled to 2% while indexing was
as intensive as earlier. That's why I'm sure that it is not hardware
performance issue.
Now we are using ElasticSearch to fetch some analytics from logs "on the
fly", so deletion the index is not very harmful yet.

Anyway, first of all we decided to add the second node (called es2) to
loadbalance the I/O.
The nodes have almost identical hardware RAID controller with RAID-5 and
writes performance is very good.

We were very surprised when the next morning we found out that %iowait on
es1 keeps to be low (~3%) while es2 repeats the previous scenario. %iowait
skyrockets after several hours.
So when %iowait becomes very large we are forced to delete the current index
(it is recreated automatically). After that %iowait falls down to almost
zero. Indexing rate is the same all the time (~10 docs per second).
http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/111.png

When %iowait on es2 skyrockets we can see massive reading form disk and huge
tps (tps ~2500 and ElasticSearch is constantly reading something from disk
at ~30MB/s). At the same time %iowait on es1 is low and there are not many
reads in comparison to es2. The documents quantity is the same on es1 and
es2 (3 shards on each node).

I think the merging is involved but ElasticHQ reports about ~5MB/s merging
rate on each node. Note that reading rate on es2 is about 30 MB/s when
experiencing the issue (we can see it from iostat).

I have no idea how to fix the issue and - the main thing - to understand
what is going on.

Any help would be greatly appreciated!
Thanks!

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/iowait-growing-with-index-size-tp4044712.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

otisg · November 22, 2013, 7:30pm

Hi,

Underneath each ES index is a Lucene index... actually one Lucene index for
one ES shard. Lucene performs index segment merges. What Lucene on es1
does and when it does it is independent of what and when it does it on es2.
That's why you don't see the same thing happening at the same time on both
es1 and es2. If you peek at Lucene index directories a bit you'll probably
see segment merging during spikes you are seeing.

Don't know about ElasticHQ, but SPM will show you relevant stats by node,
index, and shard, so you'll be able to narrow things down to
node>index>shard. If you can capture any sort of logging and feed those
logs into Logsene (hey, it's got an ES API, so you'll know how to feed
it!), then you'll be able to easily correlate both ES metrics you see in
SPM and any sort of application/ES logs that would tell you what ES is
doing, thus hopefully helping you troubleshoot this faster.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/http://www.google.com/url?q=http%3A%2F%2Fsematext.com%2F&sa=D&sntz=1&usg=AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg

On Thursday, November 21, 2013 8:30:19 AM UTC-5, Gennady Aleksandrov wrote:

Hello,

we need the help of community.

We have two nodes Elasticsearch cluster setup.
We are indexing ~10k docs per second (nginx access log entries, actually).
New index is created every day at 4 AM (local time). Each index has 6
shards. The index size for the whole day is ~210Gb.

<http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/04.png http://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Ffile%2Fn4044712%2F04.png&sa=D&sntz=1&usg=AFQjCNHkYxpCQLKhZlYD0rz6dHwSeZQtLw>

A week ago, when we had only one node (called es1), the %iowait was
growing
with the size of the index. By 9 PM it was reaching 60% and node became
almost unresponsible.
After deletion of the current index %iowait falled to 2% while indexing
was
as intensive as earlier. That's why I'm sure that it is not hardware
performance issue.
Now we are using Elasticsearch to fetch some analytics from logs "on the
fly", so deletion the index is not very harmful yet.

Anyway, first of all we decided to add the second node (called es2) to
loadbalance the I/O.
The nodes have almost identical hardware RAID controller with RAID-5 and
writes performance is very good.

We were very surprised when the next morning we found out that %iowait on
es1 keeps to be low (~3%) while es2 repeats the previous scenario. %iowait
skyrockets after several hours.
So when %iowait becomes very large we are forced to delete the current
index
(it is recreated automatically). After that %iowait falls down to almost
zero. Indexing rate is the same all the time (~10 docs per second).
<http://elasticsearch-users.115913.n3.nabble.com/file/n4044712/111.png http://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Ffile%2Fn4044712%2F111.png&sa=D&sntz=1&usg=AFQjCNHRo1qBkMejK3xjPH9mRxloiyGgTw>

When %iowait on es2 skyrockets we can see massive reading form disk and
huge
tps (tps ~2500 and Elasticsearch is constantly reading something from disk
at ~30MB/s). At the same time %iowait on es1 is low and there are not many
reads in comparison to es2. The documents quantity is the same on es1 and
es2 (3 shards on each node).

I think the merging is involved but ElasticHQ reports about ~5MB/s merging
rate on each node. Note that reading rate on es2 is about 30 MB/s when
experiencing the issue (we can see it from iostat).

I have no idea how to fix the issue and - the main thing - to understand
what is going on.

Any help would be greatly appreciated!
Thanks!

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/iowait-growing-with-index-size-tp4044712.html http://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2Fiowait-growing-with-index-size-tp4044712.html&sa=D&sntz=1&usg=AFQjCNGGCHYQ2_DPSi8Td2lE0wN_FFjB1g
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
High iowait on elasticsearch nodes Elasticsearch	3	6872	July 6, 2017
Indexing slows down dramatically as index size grows Elasticsearch	4	552	July 6, 2017
Analyzing why ES Node has lots of i-o waits Elasticsearch	7	7702	July 6, 2017
Very high disk IO while indexing Elasticsearch	10	5753	July 6, 2017
Periodic temporary cluster slowdown/freeze during long index process Elasticsearch	10	1975	July 6, 2017

%iowait growing with index size

Otis

Related topics