Sebastian - in Lucene land (and Solr, too) people run CheckIndex normally -
http://search-lucene.com/?q=%2B"CheckIndex"&sort=newestOnTop&fc_type=mail+_hash_+user
Otis
Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service
On Wednesday, September 19, 2012 2:31:58 AM UTC-4, Sebastian Lehn wrote:
Has anybody tried to repair like Igor advised?
Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:
I would try to shutdown es, backup all files in the shard index directory
and run Lucene CheckIndex tool there. I never had to run it on
elasticsearch indices, but since they are Lucene indices, it might just
work.On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:
Anyone got an idea about how to recover the shard?
On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:
Yeah, at some point couple of nodes ran out of memory. We recovered the
nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This
particular "_25fy9" segment seems to be the only failed segment. Any way to
start this shard fresh even if it means losing data in this particular
segment?Cheers
NitishOn Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:
Yes, I can see how constantly trying to merge segments and failing at
it can cause abnormal I/O load. Has this cluster ever run out of disk space
or memory while it was indexing?On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:
More information - On these 2 particular nodes, we continuously get
these warnings:[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler ] [node5]
[rolling_index][3] failed to merge
java.io.EOFException: read past EOF:
NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at
org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at
org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at
org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)These warnings are related to shard number 3 and both copies reside
on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of
these 2 nodes is because of corruption of shard 3?Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more
nodes to our cluster thinking that load would be distributed. We also
stopped all update operations. After running stable for like 1 week, today
suddenly 2 out of 10 nodes started acting up. They have
exceptionally high IO wait time and thus high load. Subsequently,
increasing query execution time. Note that we are not doing any update
operations; only simple indexing.Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load:
2012-08-16 16:16:17Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.1-b0 - Pastebin.comAll nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03.
It would be great if some pointers can be provided to track down the
problem.Cheers
NitishOn Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu
10.04.
Java version: 1.7.0_05
JVM: 23.1-b03The problems still persist:
- 1 out of nodes is still using a lot of CPU and garbage
collecting heap-memory almost every minute.- Bigdesk shows that 1 node is not receiving any GET requests (we
have continuous update operations going on).Any more suggestions?
On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty
old. If you want to use 1.6, then make sure its a recent update (like
update 33), and make sure you have the same JVM across all nodes.Also, if you are up for it, newer Ubuntu versions (there is a new
LTS as well) is recommended.On Jul 31, 2012, at 2:31 PM, Nitish Sharma sharmani...@gmail.com
wrote:Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: ES Node1 jstack · GitHub
ES Node2 (offending node) - CPU Usage 600-700%:
ES Node2 jstack · GitHub
ES Node3 - CPU Usage 100-200%: ES Node3 jstack · GitHubES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.Another observation - Bigdesk shows that there are no GET requests
coming to node1, which is kind of weird since HAProxy balances all requests
in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory
usage. Because excessive and fast heap memory usage, GC are so often that
node2 heavily skews our search performance.
Following are the heap graphs:
Node1: node1_heap - Imgur
Node2: node2_heap - Imgur
Node3: node3_heap - Imgur
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:Can you jstack another node, lets see if its doing any work as
well. Which ES version are you using? Also, JVM version, OS version, and
are you running on a virtual env or not?On Jul 31, 2012, at 1:47 AM, Nitish Sharma sharmani...@gmail.com
wrote:We are using Tire Ruby client. The ES cluster is behind HAProxy.
Thus, all search, get, and update requests are (almost) equally distributed
across all nodes.On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,
What kind of clients are you using ? Do they balance their
queries
between the five nodes or do they always query the same ? If
they do
so, it may explain this kind of behavior.Best,
Stéphane
2012/7/30 Nitish Sharma sharmani...@gmail.com:
Hi Igor,
I checked the stats and elasticsearch-head also confirmed that
each node has
equal number of shards. Moreover, interestingly, this weekend
this behaviour
(of constant high CPU usage) was taken over by another node
and the node
previously over-using CPU is now more or less normal. So, as
far as I
observed it, at any given point of time (atleast) 1 node would
be doing a
lot of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating
them using
routes.
Any other pointers?Cheers
NitishOn Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov
wrote:Interesting. Did you try to run curl
"localhost:9200/_nodes/stats?pretty=true" to make sure that
uniformal
distribution of indexing operations is really the case?On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma
wrote:We are, indeed, running a lot of "update" operations
continuously but
they are not routed to specific shards. The document to be
updated can be
present on any of the shards (on any of the nodes). And, as
I mentioned, all
shards are uniformly distributed across nodes.On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov
wrote:It looks like this node is quite busy updating documents.
Is it possible
that your indexing load is concentrated on the shards that
just happened to
be located on this particular node?On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma
wrote:Hi Igor,
I couldnt make any sense out of the jstack's dump (2000
lines long).
May be you can help - http://pastebin.com/u57QB7ra?Cheers
NitishOn Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov
wrote:Run jstack on the node that is using 600-700% of CPU and
let's see
what it's doing.On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma
wrote:HI,
We have a 5-node ES cluster. On one particular node ES
process is
consuming 600-700% CPU (8 cores) all the time. While
other nodes' CPU usage
is always below 100%. We are running 0.19.8 and each
node has equal number
of shards.
Any suggestions?Cheers
Nitish
--