Node experiencing relatively high CPU usage

Sebastian - in Lucene land (and Solr, too) people run CheckIndex normally -
http://search-lucene.com/?q=%2B"CheckIndex"&sort=newestOnTop&fc_type=mail+_hash_+user

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, September 19, 2012 2:31:58 AM UTC-4, Sebastian Lehn wrote:

Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:

I would try to shutdown es, backup all files in the shard index directory
and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered the
nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing at
it can cause abnormal I/O load. Has this cluster ever run out of disk space
or memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get
these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside
on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load:
2012-08-16 16:16:17Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.1-b0 - Pastebin.com

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage
    collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we
    have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty
old. If you want to use 1.6, then make sure its a recent update (like
update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new
LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: ES Node1 jstack · GitHub
ES Node2 (offending node) - CPU Usage 600-700%:
ES Node2 jstack · GitHub
ES Node3 - CPU Usage 100-200%: ES Node3 jstack · GitHub

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: node1_heap - Imgur
Node2: node2_heap - Imgur
Node3: node3_heap - Imgur
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their
queries
between the five nodes or do they always query the same ? If
they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that
each node has
equal number of shards. Moreover, interestingly, this weekend
this behaviour
(of constant high CPU usage) was taken over by another node
and the node
previously over-using CPU is now more or less normal. So, as
far as I
observed it, at any given point of time (atleast) 1 node would
be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating
them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov
wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma
wrote:

We are, indeed, running a lot of "update" operations
continuously but
they are not routed to specific shards. The document to be
updated can be
present on any of the shards (on any of the nodes). And, as
I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov
wrote:

It looks like this node is quite busy updating documents.
Is it possible
that your indexing load is concentrated on the shards that
just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov
wrote:

Run jstack on the node that is using 600-700% of CPU and
let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is
consuming 600-700% CPU (8 cores) all the time. While
other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each
node has equal number
of shards.
Any suggestions?

Cheers
Nitish

--

Hello Nitish,

do you have any update on how you solved this problem? was it a corrupt
index?

We have a similar problem here. One index which causes GC + cpu load of
700%. When the index is rebalanced to another node, the load goes that
node. What did you do to fix this?

Thanks,
Thibaut

On Wed, Sep 19, 2012 at 3:56 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Sebastian - in Lucene land (and Solr, too) people run CheckIndex normally

http://search-lucene.com/?q=%2B"CheckIndex"&sort=newestOnTop&fc_type=mail+_hash_+user

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, September 19, 2012 2:31:58 AM UTC-4, Sebastian Lehn wrote:

Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:

I would try to shutdown es, backup all files in the shard index
directory and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered
the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing at
it can cause abnormal I/O load. Has this cluster ever run out of disk space
or memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get
these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/**
lib/elasticsearch/skl-elasticsearch/nodes/0/indices/
rolling_index/3/index/_25zy9.**fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(
BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(
BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(
DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.
copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(
FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.
copyFieldsWithDeletions(**SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(
SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(
SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(
IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.
java:3901)
at org.apache.lucene.index.**ConcurrentMergeScheduler.doMerge(
ConcurrentMergeScheduler.java:**388)
at org.apache.lucene.index.TrackingConcurrentMergeSchedul
er.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$
MergeThread.run(**ConcurrentMergeScheduler.java:**456)

These warnings are related to shard number 3 and both copies reside
on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5
more nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/**vYmE8dZehttp://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/**
5xUeBqZU http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/**
Rafi3Fbk http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down
the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage
    collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we
    have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are
pretty old. If you want to use 1.6, then make sure its a recent update
(like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new
LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.**com/3216175https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.**com/3216198https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.**com/3216200https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET
requests coming to node1, which is kind of weird since HAProxy balances all
requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: node1_heap - Imgur
Node2: node2_heap - Imgur
Node3: node3_heap - Imgur
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <
sharmani...@gmail.com> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their
queries
between the five nodes or do they always query the same ? If
they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed
that each node has
equal number of shards. Moreover, interestingly, this weekend
this behaviour
(of constant high CPU usage) was taken over by another node
and the node
previously over-using CPU is now more or less normal. So,
as far as I
observed it, at any given point of time (atleast) 1 node
would be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating
them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov
wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?**pretty=true" to make sure
that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma
wrote:

We are, indeed, running a lot of "update" operations
continuously but
they are not routed to specific shards. The document to be
updated can be
present on any of the shards (on any of the nodes). And, as
I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov
wrote:

It looks like this node is quite busy updating documents.
Is it possible
that your indexing load is concentrated on the shards that
just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov
wrote:

Run jstack on the node that is using 600-700% of CPU and
let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is
consuming 600-700% CPU (8 cores) all the time. While
other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each
node has equal number
of shards.
Any suggestions?

Cheers
Nitish

--

--

Sebastian, I tried CheckIndex and found it quite useful for corrupted
shards.

On Wednesday, September 19, 2012 8:31:58 AM UTC+2, Sebastian Lehn wrote:

Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:

I would try to shutdown es, backup all files in the shard index directory
and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered the
nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing at
it can cause abnormal I/O load. Has this cluster ever run out of disk space
or memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get
these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside
on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load:
2012-08-16 16:16:17Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.1-b0 - Pastebin.com

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage
    collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we
    have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are pretty
old. If you want to use 1.6, then make sure its a recent update (like
update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new
LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: ES Node1 jstack · GitHub
ES Node2 (offending node) - CPU Usage 600-700%:
ES Node2 jstack · GitHub
ES Node3 - CPU Usage 100-200%: ES Node3 jstack · GitHub

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: node1_heap - Imgur
Node2: node2_heap - Imgur
Node3: node3_heap - Imgur
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com
wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their
queries
between the five nodes or do they always query the same ? If
they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that
each node has
equal number of shards. Moreover, interestingly, this weekend
this behaviour
(of constant high CPU usage) was taken over by another node
and the node
previously over-using CPU is now more or less normal. So, as
far as I
observed it, at any given point of time (atleast) 1 node would
be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating
them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov
wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma
wrote:

We are, indeed, running a lot of "update" operations
continuously but
they are not routed to specific shards. The document to be
updated can be
present on any of the shards (on any of the nodes). And, as
I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov
wrote:

It looks like this node is quite busy updating documents.
Is it possible
that your indexing load is concentrated on the shards that
just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov
wrote:

Run jstack on the node that is using 600-700% of CPU and
let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is
consuming 600-700% CPU (8 cores) all the time. While
other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each
node has equal number
of shards.
Any suggestions?

Cheers
Nitish

--

Hi Thibaut,
For us it was indeed a corrupted shard and continuous merge attempt (and
failure) for that shard was doing a lot of I/O. Thus, high load on the
node(s) where the corrupted shard was assigned.
The symptoms you describe here also look similar. I would suggest checking
logs on the node experiencing high load and look for messages potentially
suggesting corruption of a particular shard.

On Monday, September 24, 2012 4:11:02 PM UTC+2, Thibaut wrote:

Hello Nitish,

do you have any update on how you solved this problem? was it a corrupt
index?

We have a similar problem here. One index which causes GC + cpu load of
700%. When the index is rebalanced to another node, the load goes that
node. What did you do to fix this?

Thanks,
Thibaut

On Wed, Sep 19, 2012 at 3:56 PM, Otis Gospodnetic <otis.gos...@gmail.com<javascript:>

wrote:

Sebastian - in Lucene land (and Solr, too) people run CheckIndex normally

http://search-lucene.com/?q=%2B"CheckIndex"&sort=newestOnTop&fc_type=mail+_hash_+user

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, September 19, 2012 2:31:58 AM UTC-4, Sebastian Lehn wrote:

Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:

I would try to shutdown es, backup all files in the shard index
directory and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered
the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing
at it can cause abnormal I/O load. Has this cluster ever run out of disk
space or memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously get
these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/**
lib/elasticsearch/skl-elasticsearch/nodes/0/indices/
rolling_index/3/index/_25zy9.**fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(
BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(
BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(
DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.
copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(
FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.
copyFieldsWithDeletions(**SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(
SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(
SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(
IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.
java:3901)
at org.apache.lucene.index.**ConcurrentMergeScheduler.doMerge(
ConcurrentMergeScheduler.java:**388)
at org.apache.lucene.index.TrackingConcurrentMergeSchedul
er.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$
MergeThread.run(**ConcurrentMergeScheduler.java:**456)

These warnings are related to shard number 3 and both copies reside
on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5
more nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/**vYmE8dZehttp://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/**
5xUeBqZU http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/**
Rafi3Fbk http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down
the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage
    collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests
    (we have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are
pretty old. If you want to use 1.6, then make sure its a recent update
(like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a
new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <
sharmani...@gmail.com> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.**com/3216175https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.**com/3216198https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.**com/3216200https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET
requests coming to node1, which is kind of weird since HAProxy balances all
requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: node1_heap - Imgur
Node2: node2_heap - Imgur
Node3: node3_heap - Imgur
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <
sharmani...@gmail.com> wrote:

We are using Tire Ruby client. The ES cluster is behind
HAProxy. Thus, all search, get, and update requests are (almost) equally
distributed across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their
queries
between the five nodes or do they always query the same ? If
they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed
that each node has
equal number of shards. Moreover, interestingly, this
weekend this behaviour
(of constant high CPU usage) was taken over by another node
and the node
previously over-using CPU is now more or less normal. So,
as far as I
observed it, at any given point of time (atleast) 1 node
would be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating
them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov
wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?**pretty=true" to make sure
that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma
wrote:

We are, indeed, running a lot of "update" operations
continuously but
they are not routed to specific shards. The document to be
updated can be
present on any of the shards (on any of the nodes). And,
as I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov
wrote:

It looks like this node is quite busy updating documents.
Is it possible
that your indexing load is concentrated on the shards
that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov
wrote:

Run jstack on the node that is using 600-700% of CPU
and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish
Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES
process is
consuming 600-700% CPU (8 cores) all the time. While
other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each
node has equal number
of shards.
Any suggestions?

Cheers
Nitish

--

--

Can you share the ES version you got the corrupt index with? Not the current one that you use, but the one that was used when the failure happened (like running out of file desc, or out of memory).

On Oct 2, 2012, at 12:01 PM, Nitish Sharma sharmanitishdutt@gmail.com wrote:

Hi Thibaut,
For us it was indeed a corrupted shard and continuous merge attempt (and failure) for that shard was doing a lot of I/O. Thus, high load on the node(s) where the corrupted shard was assigned.
The symptoms you describe here also look similar. I would suggest checking logs on the node experiencing high load and look for messages potentially suggesting corruption of a particular shard.

On Monday, September 24, 2012 4:11:02 PM UTC+2, Thibaut wrote:
Hello Nitish,

do you have any update on how you solved this problem? was it a corrupt index?

We have a similar problem here. One index which causes GC + cpu load of 700%. When the index is rebalanced to another node, the load goes that node. What did you do to fix this?

Thanks,
Thibaut

On Wed, Sep 19, 2012 at 3:56 PM, Otis Gospodnetic otis.gos...@gmail.com wrote:
Sebastian - in Lucene land (and Solr, too) people run CheckIndex normally - http://search-lucene.com/?q=%2B"CheckIndex"&sort=newestOnTop&fc_type=mail+_hash_+user

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, September 19, 2012 2:31:58 AM UTC-4, Sebastian Lehn wrote:
Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:
I would try to shutdown es, backup all files in the shard index directory and run Lucene CheckIndex tool there. I never had to run it on elasticsearch indices, but since they are Lucene indices, it might just work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:
Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:
Yeah, at some point couple of nodes ran out of memory. We recovered the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This particular "_25fy9" segment seems to be the only failed segment. Any way to start this shard fresh even if it means losing data in this particular segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:
Yes, I can see how constantly trying to merge segments and failing at it can cause abnormal I/O load. Has this cluster ever run out of disk space or memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:
More information - On these 2 particular nodes, we continuously get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: 2012-08-16 16:16:17Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.1-b0 - Pastebin.com

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: ES Node1 jstack · GitHub
ES Node2 (offending node) - CPU Usage 600-700%: ES Node2 jstack · GitHub
ES Node3 - CPU Usage 100-200%: ES Node3 jstack · GitHub

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
Node1: node1_heap - Imgur
Node2: node2_heap - Imgur
Node3: node3_heap - Imgur
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each node has
equal number of shards. Moreover, interestingly, this weekend this behaviour
(of constant high CPU usage) was taken over by another node and the node
previously over-using CPU is now more or less normal. So, as far as I
observed it, at any given point of time (atleast) 1 node would be doing a
lot
of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:

We are, indeed, running a lot of "update" operations continuously but
they are not routed to specific shards. The document to be updated can be
present on any of the shards (on any of the nodes). And, as I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:

It looks like this node is quite busy updating documents. Is it possible
that your indexing load is concentrated on the shards that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:

Run jstack on the node that is using 600-700% of CPU and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node ES process is
consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each node has equal number
of shards.
Any suggestions?

Cheers
Nitish

--

--

--

Hi Kimchy,
We got couple of out-of-memory errors on 0.19.4 after which the shards got
corrupted. Currently, we operate on 0.19.8.

On Tuesday, October 2, 2012 8:28:30 PM UTC+2, kimchy wrote:

Can you share the ES version you got the corrupt index with? Not the
current one that you use, but the one that was used when the failure
happened (like running out of file desc, or out of memory).

On Oct 2, 2012, at 12:01 PM, Nitish Sharma <sharmani...@gmail.com<javascript:>>
wrote:

Hi Thibaut,
For us it was indeed a corrupted shard and continuous merge attempt (and
failure) for that shard was doing a lot of I/O. Thus, high load on the
node(s) where the corrupted shard was assigned.
The symptoms you describe here also look similar. I would suggest checking
logs on the node experiencing high load and look for messages potentially
suggesting corruption of a particular shard.

On Monday, September 24, 2012 4:11:02 PM UTC+2, Thibaut wrote:

Hello Nitish,

do you have any update on how you solved this problem? was it a corrupt
index?

We have a similar problem here. One index which causes GC + cpu load of
700%. When the index is rebalanced to another node, the load goes that
node. What did you do to fix this?

Thanks,
Thibaut

On Wed, Sep 19, 2012 at 3:56 PM, Otis Gospodnetic otis.gos...@gmail.comwrote:

Sebastian - in Lucene land (and Solr, too) people run CheckIndex
normally -
http://search-lucene.com/?q=%2B"CheckIndex"&sort=newestOnTop&fc_type=mail+_hash_+user

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Wednesday, September 19, 2012 2:31:58 AM UTC-4, Sebastian Lehn wrote:

Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:

I would try to shutdown es, backup all files in the shard index
directory and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:

Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:

Yeah, at some point couple of nodes ran out of memory. We recovered
the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:

Yes, I can see how constantly trying to merge segments and failing
at it can cause abnormal I/O load. Has this cluster ever run out of disk
space or memory while it was indexing?

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:

More information - On these 2 particular nodes, we continuously
get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ]
[node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/**
lib/elasticsearch/skl-elasticsearch/nodes/0/indices/
rolling_index/3/index/_25zy9.**fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(
BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(
BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(
DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.
copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(
FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.
copyFieldsWithDeletions(**SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(
SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(
SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(
IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.
java:3901)
at org.apache.lucene.index.**ConcurrentMergeScheduler.*doMerge(
*ConcurrentMergeScheduler.java:**388)
at org.apache.lucene.index.TrackingConcurrentMergeSchedul
er.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$
MergeThread.run(**ConcurrentMergeScheduler.java:**456)

These warnings are related to shard number 3 and both copies
reside on these 2 nodes. Could it be possible that abnormal (IO load)
behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5
more nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/**vYmE8dZehttp://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/**
5xUeBqZU http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/**
Rafi3Fbk http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down
the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:

Hi Kimchy,
I just updated all nodes to use Java 7; though still using
Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:

  • 1 out of nodes is still using a lot of CPU and garbage
    collecting heap-memory almost every minute.
  • Bigdesk shows that 1 node is not receiving any GET requests
    (we have continuous update operations going on).

Any more suggestions? :confused:

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:

Can you try and upgrade to a newer JVM, the one you use are
pretty old. If you want to use 1.6, then make sure its a recent update
(like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a
new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <
sharmani...@gmail.com> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.**
com/3216175 https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%:
https://gist.github.com/3216198https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.

com/3216200 https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET
requests coming to node1, which is kind of weird since HAProxy balances all
requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: node1_heap - Imgur
Node2: node2_heap - Imgur
Node3: node3_heap - Imgur
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:

Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <
sharmani...@gmail.com> wrote:

We are using Tire Ruby client. The ES cluster is behind
HAProxy. Thus, all search, get, and update requests are (almost) equally
distributed across all nodes.

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:

Hi,

What kind of clients are you using ? Do they balance their
queries
between the five nodes or do they always query the same ? If
they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma sharmani...@gmail.com:

Hi Igor,
I checked the stats and elasticsearch-head also confirmed
that each node has
equal number of shards. Moreover, interestingly, this
weekend this behaviour
(of constant high CPU usage) was taken over by another node
and the node
previously over-using CPU is now more or less normal. So,
as far as I
observed it, at any given point of time (atleast) 1 node
would be doing a
lot
of pure-CPU, while other nodes are fairly quiet.
Weird!
We are not indexing documents with routing, neither
updating them using
routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov
wrote:

Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?**pretty=true" to make sure
that uniformal
distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma
wrote:

We are, indeed, running a lot of "update" operations
continuously but
they are not routed to specific shards. The document to
be updated can be
present on any of the shards (on any of the nodes). And,
as I mentioned, all
shards are uniformly distributed across nodes.

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov
wrote:

It looks like this node is quite busy updating
documents. Is it possible
that your indexing load is concentrated on the shards
that just happened to
be located on this particular node?

On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:

Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).
May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov
wrote:

Run jstack on the node that is using 600-700% of CPU
and let's see
what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish
Sharma wrote:

HI,
We have a 5-node ES cluster. On one particular node
ES process is
consuming 600-700% CPU (8 cores) all the time. While
other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each
node has equal number
of shards.
Any suggestions?

Cheers
Nitish

--

--

--

Hi Shay,
We recently upgraded to 0.19.10 so that we can use hot_threads to
investigate into occasional high loads on our ES cluster. But from the very
next day, after the upgrade, we again started getting merge failure
warnings, resulting in high I/O load.
I've tried deleting the faulty segment from one node to force it resync the
segment from the other node. But it looks like that this particular segment
is failing to merge on both nodes. Following is the gist from both nodes:


This time we didnt encounter any out-of-memory or similar error. I also
read about some folks facing persistent high CPU usage because of
compatibility issues between netty and Java7. But we have load resulting
from pure I/O (no high CPU usage) and it goes up only when merge thread
fires up.
I dont want to run checkIndex since that will just delete this segment and
we would end losing the data. Any suggestions? I know for these kind of
problems IRC is a better place and its faster to debug things there but its
kinda difficult to get hold of you there.

Cheers
Nitish

--