0.90.1 I/O 100% at one node

Gustavo_Maia · August 1, 2013, 5:33pm

I have some problems in my ES cluster when i upgraded from version 0.20.1
to version 0.90.1.

My cluster has 10 servers. At least once a day one node has I/O 100%,
looking at plugin paramedic i saw that an node gets 50-60 concurrent
queries and others get 1 or 2 queries.

When i restart the cluster that node back to normal. But after 30min 40min
another node goes to I / O 100%.

When I execute full restart, my cluster keeps for a longer time without
the problem occurs.

I'm adding the screen shout of the paramedic plugin to see if it helps.

Thanks

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · August 4, 2013, 6:03am

hey,

are you using any kind of routing? did you change anything in the way you
query your data after you upgraded?

simon

On Thursday, August 1, 2013 7:33:16 PM UTC+2, Gustavo Maia wrote:

I have some problems in my ES cluster when i upgraded from version 0.20.1
to version 0.90.1.

My cluster has 10 servers. At least once a day one node has I/O 100%,
looking at plugin paramedic i saw that an node gets 50-60 concurrent
queries and others get 1 or 2 queries.

When i restart the cluster that node back to normal. But after 30min 40min
another node goes to I / O 100%.

When I execute full restart, my cluster keeps for a longer time without
the problem occurs.

I'm adding the screen shout of the paramedic plugin to see if it helps.
http://i.imgur.com/utOR3jJ.png

Thanks

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gustavo_Maia · August 7, 2013, 12:52pm

Hi, Simonw,

Sorry for the delay.

I'm not using any kind of route and also not changed anything in my index,
just deployed the new version of the ES.

I configured cron to be able to run the optimizer every day at 00:00 when
my traffic is lower. With this change I noticed an improvement. The number
of concurrent query is normal in all nodes, however IO starts to increase
with time. When I restart the cluster IO is 30% and over time (1, 2 days)
shall be 60% (max)

Any questions or suggestions will be cool.

I'll post any news on the status of the cluster

Gustavo Maia

2013/8/4 simonw simon.willnauer@elasticsearch.com

hey,

are you using any kind of routing? did you change anything in the way you
query your data after you upgraded?

simon

On Thursday, August 1, 2013 7:33:16 PM UTC+2, Gustavo Maia wrote:

I have some problems in my ES cluster when i upgraded from version 0.20.1
to version 0.90.1.

My cluster has 10 servers. At least once a day one node has I/O 100%,
looking at plugin paramedic i saw that an node gets 50-60 concurrent
queries and others get 1 or 2 queries.

When i restart the cluster that node back to normal. But after 30min
40min another node goes to I / O 100%.

When I execute full restart, my cluster keeps for a longer time without
the problem occurs.

I'm adding the screen shout of the paramedic plugin to see if it helps.
http://i.imgur.com/utOR3jJ.png

Thanks

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · August 7, 2013, 2:57pm

hey,

I'd guess that you are seeing the impact of segment merges. I am wondering
if you see a drop in index size (physical size on disk) on the nodes once
they come back to "normal". I am curious what causes this but I guess it's
merging. Do you have a lot of update coming in and can you elaborate about
your setup a little? From the screenshot I can tell that you have 20M docs
but it seems to be on a single node? I can't really make sense out of the
screenshot, can you provide some more output from your cluster?

simon

On Wednesday, August 7, 2013 2:52:38 PM UTC+2, Gustavo Maia wrote:

Hi, Simonw,

Sorry for the delay.

I'm not using any kind of route and also not changed anything in my index,
just deployed the new version of the ES.

I configured cron to be able to run the optimizer every day at 00:00 when
my traffic is lower. With this change I noticed an improvement. The number
of concurrent query is normal in all nodes, however IO starts to increase
with time. When I restart the cluster IO is 30% and over time (1, 2 days)
shall be 60% (max)

Any questions or suggestions will be cool.

I'll post any news on the status of the cluster

Gustavo Maia

2013/8/4 simonw <simon.w...@elasticsearch.com <javascript:>>

hey,

are you using any kind of routing? did you change anything in the way you
query your data after you upgraded?

simon

On Thursday, August 1, 2013 7:33:16 PM UTC+2, Gustavo Maia wrote:

I have some problems in my ES cluster when i upgraded from version
0.20.1 to version 0.90.1.

My cluster has 10 servers. At least once a day one node has I/O 100%,
looking at plugin paramedic i saw that an node gets 50-60 concurrent
queries and others get 1 or 2 queries.

When i restart the cluster that node back to normal. But after 30min
40min another node goes to I / O 100%.

When I execute full restart, my cluster keeps for a longer time
without the problem occurs.

I'm adding the screen shout of the paramedic plugin to see if it helps.
http://i.imgur.com/utOR3jJ.png

Thanks

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gustavo_Maia · August 8, 2013, 12:56pm

Hi,

Let me give a brief description of my cluster.
Today, we have 10 nodes (10 servers), each node with 30GB RAM, 400GB SSD
Raid 1, 8 cores. The ES Cluster has 10 indexes which are divided between
the nodes. Each index has 10 shards, only 2 (smaller indexes) have replica.
All indexes have a total of 690GB which are divided between nodes, getting
69GB per node. Only onde index has 530.9gb and is responsible for almost
90% of the size.

The problem occurred again today.
As you recommended, after high IO, I checked if it had any index with some
fragment size generated recently. I can guarantee that there was no merge
involving large fragments. See ( http://i.imgur.com/we1rAix.png )

On newrelic, I can see that the IO is rising over time. There are four days
since I restarted my cluster and the IO is growing. See (
http://i.imgur.com/u0EQdwv.png)

When node "esdoc6" had the problem, i ran iotop command to check if it was
really only ES consuming all IO (330mb/s). See (
http://i.imgur.com/c1tCnM7.png)

I'll try to upgrade to the newer version of ES that just came out.

Any questions or suggestions will be cool.

I'll post any news on the status of the cluster

2013/8/7 simonw simon.willnauer@elasticsearch.com

hey,

I'd guess that you are seeing the impact of segment merges. I am wondering
if you see a drop in index size (physical size on disk) on the nodes once
they come back to "normal". I am curious what causes this but I guess it's
merging. Do you have a lot of update coming in and can you elaborate about
your setup a little? From the screenshot I can tell that you have 20M docs
but it seems to be on a single node? I can't really make sense out of the
screenshot, can you provide some more output from your cluster?

simon

On Wednesday, August 7, 2013 2:52:38 PM UTC+2, Gustavo Maia wrote:

Hi, Simonw,

Sorry for the delay.

I'm not using any kind of route and also not changed anything in my
index, just deployed the new version of the ES.

I configured cron to be able to run the optimizer every day at 00:00 when
my traffic is lower. With this change I noticed an improvement. The number
of concurrent query is normal in all nodes, however IO starts to increase
with time. When I restart the cluster IO is 30% and over time (1, 2 days)
shall be 60% (max)

Any questions or suggestions will be cool.

I'll post any news on the status of the cluster

Gustavo Maia

2013/8/4 simonw <simon.w...@**elasticsearch.com>

hey,

are you using any kind of routing? did you change anything in the way
you query your data after you upgraded?

simon

On Thursday, August 1, 2013 7:33:16 PM UTC+2, Gustavo Maia wrote:

I have some problems in my ES cluster when i upgraded from version
0.20.1 to version 0.90.1.

My cluster has 10 servers. At least once a day one node has I/O 100%,
looking at plugin paramedic i saw that an node gets 50-60 concurrent
queries and others get 1 or 2 queries.

When i restart the cluster that node back to normal. But after 30min
40min another node goes to I / O 100%.

When I execute full restart, my cluster keeps for a longer time
without the problem occurs.

I'm adding the screen shout of the paramedic plugin to see if it helps.
http://i.imgur.com/utOR3jJ.png

Thanks

--
Gustavo Maia

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gustavo_Maia · August 9, 2013, 1:34am

Hi,

My cluster IO was quite high (60%) and decided to restart the entire
cluster.
The cluster has already 30min restarts and IO are below 15%.

See (http://i.imgur.com/wgPVWzF.png)

Thanks for all.

2013/8/8 Gustavo Maia gustavobbmaia@gmail.com

Hi,

Let me give a brief description of my cluster.
Today, we have 10 nodes (10 servers), each node with 30GB RAM, 400GB SSD
Raid 1, 8 cores. The ES Cluster has 10 indexes which are divided between
the nodes. Each index has 10 shards, only 2 (smaller indexes) have replica.
All indexes have a total of 690GB which are divided between nodes, getting
69GB per node. Only onde index has 530.9gb and is responsible for almost
90% of the size.

The problem occurred again today.
As you recommended, after high IO, I checked if it had any index with
some fragment size generated recently. I can guarantee that there was no
merge involving large fragments. See ( http://i.imgur.com/we1rAix.png )

On newrelic, I can see that the IO is rising over time. There are four
days since I restarted my cluster and the IO is growing. See (
http://i.imgur.com/u0EQdwv.png)

When node "esdoc6" had the problem, i ran iotop command to check if it
was really only ES consuming all IO (330mb/s). See (
http://i.imgur.com/c1tCnM7.png)

I'll try to upgrade to the newer version of ES that just came out.

Any questions or suggestions will be cool.

I'll post any news on the status of the cluster

2013/8/7 simonw simon.willnauer@elasticsearch.com

hey,

I'd guess that you are seeing the impact of segment merges. I am
wondering if you see a drop in index size (physical size on disk) on the
nodes once they come back to "normal". I am curious what causes this but I
guess it's merging. Do you have a lot of update coming in and can you
elaborate about your setup a little? From the screenshot I can tell that
you have 20M docs but it seems to be on a single node? I can't really make
sense out of the screenshot, can you provide some more output from your
cluster?

simon

On Wednesday, August 7, 2013 2:52:38 PM UTC+2, Gustavo Maia wrote:

Hi, Simonw,

Sorry for the delay.

I'm not using any kind of route and also not changed anything in my
index, just deployed the new version of the ES.

I configured cron to be able to run the optimizer every day at 00:00
when my traffic is lower. With this change I noticed an improvement. The
number of concurrent query is normal in all nodes, however IO starts to
increase with time. When I restart the cluster IO is 30% and over time (1,
2 days) shall be 60% (max)

Any questions or suggestions will be cool.

I'll post any news on the status of the cluster

Gustavo Maia

2013/8/4 simonw <simon.w...@**elasticsearch.com>

hey,

are you using any kind of routing? did you change anything in the way
you query your data after you upgraded?

simon

On Thursday, August 1, 2013 7:33:16 PM UTC+2, Gustavo Maia wrote:

I have some problems in my ES cluster when i upgraded from version
0.20.1 to version 0.90.1.

My cluster has 10 servers. At least once a day one node has I/O 100%,
looking at plugin paramedic i saw that an node gets 50-60 concurrent
queries and others get 1 or 2 queries.

When i restart the cluster that node back to normal. But after 30min
40min another node goes to I / O 100%.

When I execute full restart, my cluster keeps for a longer time
without the problem occurs.

I'm adding the screen shout of the paramedic plugin to see if it helps.
http://i.imgur.com/utOR3jJ.png

Thanks

--
Gustavo Maia

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Gustavo Maia

--
Gustavo Maia

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Cluster crash, symptoms and possible explanation Elasticsearch	20	2134	July 6, 2017
Cluster not able to keep up? Elasticsearch	12	4234	July 6, 2017
Very high disk IO while indexing Elasticsearch	10	5756	July 6, 2017
Increase in CPU usage and indexing time after upgrade from 0.19.2 -> 0.19.10 Elasticsearch	3	879	July 6, 2017
Node experiencing relatively high CPU usage Elasticsearch	27	4139	July 6, 2017

0.90.1 I/O 100% at one node

Related topics