Node experiencing relatively high CPU usage


(Nitish Sharma) #1

HI,
We have a 5-node ES cluster. On one particular node ES process is consuming
600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always
below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish


(Igor Motov) #2

Run jstack on the node that is using 600-700% of CPU and let's see what
it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Nitish Sharma) #3

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May be
you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see what
it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Igor Motov) #4

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May
be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see what
it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Nitish Sharma) #5

We are, indeed, running a lot of "update" operations continuously but they
are not routed to specific shards. The document to be updated can be
present on any of the shards (on any of the nodes). And, as I mentioned,
all shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May
be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see what
it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Igor Motov) #6

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but they
are not routed to specific shards. The document to be updated can be
present on any of the shards (on any of the nodes). And, as I mentioned,
all shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May
be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see what
it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Nitish Sharma) #7

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each node
has equal number of shards. Moreover, interestingly, this weekend this
behaviour (of constant high CPU usage) was taken over by another node and
the node previously over-using CPU is now more or less normal. So, as far
as I observed it, at any given point of time (atleast) 1 node would be
doing a lot of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but
they are not routed to specific shards. The document to be updated can be
present on any of the shards (on any of the nodes). And, as I mentioned,
all shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Stéphane Raux) #8

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each node has
equal number of shards. Moreover, interestingly, this weekend this behaviour
(of constant high CPU usage) was taken over by another node and the node
previously over-using CPU is now more or less normal. So, as far as I
observed it, at any given point of time (atleast) 1 node would be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but
they are not routed to specific shards. The document to be updated can be
present on any of the shards (on any of the nodes). And, as I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Nitish Sharma) #9

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all
search, get, and update requests are (almost) equally distributed across
all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each node
has
equal number of shards. Moreover, interestingly, this weekend this
behaviour
(of constant high CPU usage) was taken over by another node and the node
previously over-using CPU is now more or less normal. So, as far as I
observed it, at any given point of time (atleast) 1 node would be doing
a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but
they are not routed to specific shards. The document to be updated can
be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it
possible

that your indexing load is concentrated on the shards that just
happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes'
CPU usage

is always below 100%. We are running 0.19.8 and each node has
equal number

of shards.
Any suggestions?

Cheers
Nitish


(Shay Banon) #10

Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmanitishdutt@gmail.com wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each node has
equal number of shards. Moreover, interestingly, this weekend this behaviour
(of constant high CPU usage) was taken over by another node and the node
previously over-using CPU is now more or less normal. So, as far as I
observed it, at any given point of time (atleast) 1 node would be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but
they are not routed to specific shards. The document to be updated can be
present on any of the shards (on any of the nodes). And, as I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Nitish Sharma) #11

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage
600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming
to node1, which is kind of weird since HAProxy balances all requests in
round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage.
Because excessive and fast heap memory usage, GC are so often that node2
heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as well. Which
ES version are you using? Also, JVM version, OS version, and are you
running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmanitishdutt@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all
search, get, and update requests are (almost) equally distributed across
all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each
node has
equal number of shards. Moreover, interestingly, this weekend this
behaviour
(of constant high CPU usage) was taken over by another node and the
node
previously over-using CPU is now more or less normal. So, as far as I
observed it, at any given point of time (atleast) 1 node would be doing
a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but
they are not routed to specific shards. The document to be updated
can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it
possible

that your indexing load is concentrated on the shards that just
happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines
long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes'
CPU usage

is always below 100%. We are running 0.19.8 and each node has
equal number

of shards.
Any suggestions?

Cheers
Nitish


(Shay Banon) #12

Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmanitishdutt@gmail.com wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmanitishdutt@gmail.com wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each node has
equal number of shards. Moreover, interestingly, this weekend this behaviour
(of constant high CPU usage) was taken over by another node and the node
previously over-using CPU is now more or less normal. So, as far as I
observed it, at any given point of time (atleast) 1 node would be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but
they are not routed to specific shards. The document to be updated can be
present on any of the shards (on any of the nodes). And, as I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish


(Nitish Sharma) #13

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage collecting
    heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we have
    continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty old. If
you want to use 1.6, then make sure its a recent update (like update 33),
and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as
well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmanitishdutt@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming
to node1, which is kind of weird since HAProxy balances all requests in
round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage.
Because excessive and fast heap memory usage, GC are so often that node2
heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as well.
Which ES version are you using? Also, JVM version, OS version, and are you
running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmanitishdutt@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus,
all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each
node has
equal number of shards. Moreover, interestingly, this weekend this
behaviour
(of constant high CPU usage) was taken over by another node and the
node
previously over-using CPU is now more or less normal. So, as far as
I
observed it, at any given point of time (atleast) 1 node would be
doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them
using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously
but

they are not routed to specific shards. The document to be updated
can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it
possible

that your indexing load is concentrated on the shards that just
happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines
long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's
see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process
is

consuming 600-700% CPU (8 cores) all the time. While other
nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node has
equal number

of shards.
Any suggestions?

Cheers
Nitish


(Nitish Sharma) #14

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes
to our cluster thinking that load would be distributed. We also stopped all
update operations. After running stable for like 1 week, today suddenly 2 out
of 10 nodes started acting up. They have exceptionally high IO wait time
and thus high load. Subsequently, increasing query execution time. Note
that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage collecting
    heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we have
    continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty old.
If you want to use 1.6, then make sure its a recent update (like update
33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as
well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmanitishdutt@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming
to node1, which is kind of weird since HAProxy balances all requests in
round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage.
Because excessive and fast heap memory usage, GC are so often that node2
heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as well.
Which ES version are you using? Also, JVM version, OS version, and are you
running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmanitishdutt@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus,
all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each
node has
equal number of shards. Moreover, interestingly, this weekend this
behaviour
(of constant high CPU usage) was taken over by another node and the
node
previously over-using CPU is now more or less normal. So, as far as
I
observed it, at any given point of time (atleast) 1 node would be
doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them
using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal

distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously
but

they are not routed to specific shards. The document to be updated
can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it
possible

that your indexing load is concentrated on the shards that just
happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines
long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's
see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process
is

consuming 600-700% CPU (8 cores) all the time. While other
nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node has
equal number

of shards.
Any suggestions?

Cheers
Nitish

--


(Nitish Sharma) #15

More information - On these 2 particular nodes, we continuously get these
warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on
these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes
to our cluster thinking that load would be distributed. We also stopped all
update operations. After running stable for like 1 week, today suddenly *2

  • out of 10 nodes started acting up. They have exceptionally high IO
    wait time and thus high load. Subsequently, increasing query execution
    time. Note that we are not doing any update operations; only simple
    indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage collecting
    heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we have
    continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty old.
If you want to use 1.6, then make sure its a recent update (like update
33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as
well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmanitishdutt@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage.
Because excessive and fast heap memory usage, GC are so often that node2
heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as well.
Which ES version are you using? Also, JVM version, OS version, and are you
running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmanitishdutt@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus,
all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmanitishdutt@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each
node has
equal number of shards. Moreover, interestingly, this weekend this
behaviour
(of constant high CPU usage) was taken over by another node and the
node
previously over-using CPU is now more or less normal. So, as far
as I
observed it, at any given point of time (atleast) 1 node would be
doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them
using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal

distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously
but

they are not routed to specific shards. The document to be updated
can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it
possible

that your indexing load is concentrated on the shards that just
happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines
long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's
see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process
is

consuming 600-700% CPU (8 cores) all the time. While other
nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node has
equal number

of shards.
Any suggestions?

Cheers
Nitish

--


(Igor Motov) #16

Yes, I can see how constantly trying to merge segments and failing at it
can cause abnormal I/O load. Has this cluster ever run out of disk space or
memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get these
warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on
these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage collecting
    heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we have
    continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty old.
If you want to use 1.6, then make sure its a recent update (like update
33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS
as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <sharmani...@gmail.com<javascript:>>
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage.
Because excessive and fast heap memory usage, GC are so often that node2
heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as well.
Which ES version are you using? Also, JVM version, OS version, and are you
running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <sharmani...@gmail.com<javascript:>>
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus,
all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <sharmani...@gmail.com <javascript:>>:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each
node has
equal number of shards. Moreover, interestingly, this weekend this
behaviour
(of constant high CPU usage) was taken over by another node and the
node
previously over-using CPU is now more or less normal. So, as far
as I
observed it, at any given point of time (atleast) 1 node would be
doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them
using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal

distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously
but

they are not routed to specific shards. The document to be
updated can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it
possible

that your indexing load is concentrated on the shards that just
happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines
long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's
see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is

consuming 600-700% CPU (8 cores) all the time. While other
nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node has
equal number

of shards.
Any suggestions?

Cheers
Nitish

--


(Nitish Sharma) #17

Yeah, at some point couple of nodes ran out of memory. We recovered the
nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This particular
"_25fy9" segment seems to be the only failed segment. Any way to start this
shard fresh even if it means losing data in this particular segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing at it
can cause abnormal I/O load. Has this cluster ever run out of disk space or
memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get these
warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on
these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage collecting
    heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we have
    continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty
old. If you want to use 1.6, then make sure its a recent update (like
update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS
as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage.
Because excessive and fast heap memory usage, GC are so often that node2
heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as well.
Which ES version are you using? Also, JVM version, OS version, and are you
running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that
each node has
equal number of shards. Moreover, interestingly, this weekend this
behaviour
(of constant high CPU usage) was taken over by another node and
the node
previously over-using CPU is now more or less normal. So, as far
as I
observed it, at any given point of time (atleast) 1 node would be
doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them
using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal

distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations
continuously but

they are not routed to specific shards. The document to be
updated can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it
possible

that your indexing load is concentrated on the shards that just
happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines
long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and
let's see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is

consuming 600-700% CPU (8 cores) all the time. While other
nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node
has equal number

of shards.
Any suggestions?

Cheers
Nitish

--


(Nitish Sharma) #18

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered the
nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing at it
can cause abnormal I/O load. Has this cluster ever run out of disk space or
memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get
these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on
these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage collecting
    heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we
    have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty
old. If you want to use 1.6, then make sure its a recent update (like
update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS
as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as well.
Which ES version are you using? Also, JVM version, OS version, and are you
running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they
do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that
each node has
equal number of shards. Moreover, interestingly, this weekend
this behaviour
(of constant high CPU usage) was taken over by another node and
the node
previously over-using CPU is now more or less normal. So, as
far as I
observed it, at any given point of time (atleast) 1 node would be
doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them
using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal

distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations
continuously but

they are not routed to specific shards. The document to be
updated can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is
it possible

that your indexing load is concentrated on the shards that
just happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines
long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and
let's see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is

consuming 600-700% CPU (8 cores) all the time. While other
nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node
has equal number

of shards.
Any suggestions?

Cheers
Nitish

--


(Igor Motov) #19

I would try to shutdown es, backup all files in the shard index directory
and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered the
nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing at it
can cause abnormal I/O load. Has this cluster ever run out of disk space or
memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get
these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on
these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage
    collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we
    have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty
old. If you want to use 1.6, then make sure its a recent update (like
update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new
LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they
do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that
each node has
equal number of shards. Moreover, interestingly, this weekend
this behaviour
(of constant high CPU usage) was taken over by another node and
the node
previously over-using CPU is now more or less normal. So, as
far as I
observed it, at any given point of time (atleast) 1 node would
be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating
them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal

distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations
continuously but

they are not routed to specific shards. The document to be
updated can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is
it possible

that your indexing load is concentrated on the shards that
just happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and
let's see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is

consuming 600-700% CPU (8 cores) all the time. While other
nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node
has equal number

of shards.
Any suggestions?

Cheers
Nitish

--


(Sebastian Lehn) #20

Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:

I would try to shutdown es, backup all files in the shard index directory
and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered the
nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing at
it can cause abnormal I/O load. Has this cluster ever run out of disk space
or memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get
these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on
these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load:
http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage
    collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we
    have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty
old. If you want to use 1.6, then make sure its a recent update (like
update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new
LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their
queries
between the five nodes or do they always query the same ? If they
do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that
each node has
equal number of shards. Moreover, interestingly, this weekend
this behaviour
(of constant high CPU usage) was taken over by another node and
the node
previously over-using CPU is now more or less normal. So, as
far as I
observed it, at any given point of time (atleast) 1 node would
be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating
them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal

distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma
wrote:

We are, indeed, running a lot of "update" operations
continuously but

they are not routed to specific shards. The document to be
updated can be

present on any of the shards (on any of the nodes). And, as I
mentioned, all

shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is
it possible

that your indexing load is concentrated on the shards that
just happened to

be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).

May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov
wrote:

Run jstack on the node that is using 600-700% of CPU and
let's see

what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is

consuming 600-700% CPU (8 cores) all the time. While
other nodes' CPU usage

is always below 100%. We are running 0.19.8 and each node
has equal number

of shards.
Any suggestions?

Cheers
Nitish

--