We think the OOM was the main culprit for sure now, here is the earliest
logs pointing out the trouble. The shard 0 corruption seems to be immediate:
[2013-12-15 23:46:15,379][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:13,930][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:48,691][WARN ][index.engine.robin ] [Centurious?]
[zapier_legacy][0] failed engine
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:48,690][WARN ][index.merge.scheduler ] [Centurious?]
[zapier_legacy][6] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:52,573][DEBUG][action.index ] [Centurious?]
[zapier_legacy][0], node[tRG516VtQ5OdU8eNpga44g], [P], s[STARTED]: Failed
to execute [index {...}]
org.elasticsearch.index.engine.IndexFailedEngineException:
[zapier_legacy][0] Index failed for
[write#599581e0-e0a9-4017-a961-75bbbaeff4c4]
[2013-12-15 23:46:52,630][WARN ][index.shard.service ] [Centurious?]
[zapier_legacy][0] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException:
[zapier_legacy][0] Refresh failed
[2013-12-15 23:46:52,933][WARN ][cluster.action.shard ] [Centurious?]
[zapier_legacy][0] sending failed shard for
[zapier_legacy][0], node[tRG516VtQ5OdU8eNpga44g], [P], s[STARTED],
indexUUID [pzWL-WO_SsaGbuWfn2IQaw], reason [engin
e failure, message [OutOfMemoryError[Java heap space]]]
[2013-12-15 23:46:59,249][WARN ][indices.cluster ] [Centurious?]
[zapier_legacy][0] master [[Hera][_WWQvOUIR
Km5F-A822JsKw][inet[/10.211.7.15:9300]]{aws_availability_zone=us-east-1c}]
marked shard as started, but shard has not
been created, mark shard as failed
[2013-12-15 23:46:59,249][WARN ][cluster.action.shard ] [Centurious?]
[zapier_legacy][0] sending failed shard for [zapier_legacy][0],
node[tRG516VtQ5OdU8eNpga44g], [P], s[STARTED], indexUUID
[pzWL-WO_SsaGbuWfn2IQaw], reason [master
[Hera][_WWQvOUIRKm5F-A822JsKw][inet[/10.211.7.15:9300]]{aws_availability_zone=us-east-1c}
marked shard as started, but shard has not been created, mark shard as
failed]
If we restart a single node with the index.shard.check_on_startup: true,
would that suffice or does it need to be a cluster restart?
On Tuesday, December 17, 2013 2:37:36 PM UTC-8, Alexander Reelsen wrote:
Hey,
na, sent too early. What OOM exception were you hitting? Was this due to
querying your data? Just trying to make sure, there was nothing
out-of-ordinary which triggered that corruption.
--Alex
On Tue, Dec 17, 2013 at 11:36 PM, Alexander Reelsen <a...@spinscale.de<javascript:>
wrote:
Hey,
somehow your index data is corrupt. You could set
'index.shard.check_on_startup' to true and check its output. This triggers
a lucene CheckIndex, which might rewrite your segments file if it runs
successfully.
--Alex
On Tue, Dec 17, 2013 at 8:25 PM, Bryan Helmig <br...@zapier.com<javascript:>
wrote:
We're at "max_file_descriptors" : 65535 right now, and I haven't seen
anything around file handles in the logs (and we have plenty of disk
space). We did have an OOM exception due to a misconfigured heap a few days
ago, but we did a rolling restart after a fix and it all seemed fine.
gist:091839e6a48a4e103699 · GitHub has the full
log with the exception repeated over and over.
gist:3c17edfe5c4e9065e5a3 · GitHub was the first
log that had an error, which has a few other interesting lines like:
"MergeException[java.io.EOFException: read past EOF: NIOFSIndexInput ..."
On Tuesday, December 17, 2013 11:15:36 AM UTC-8, Alexander Reelsen wrote:
Hey,
is it possible that there is actually an exception happening before the
NativeFSLock exception occured? Running out of disk space or file handles
or something like that?
--Alex
On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig br...@zapier.com wrote:
Hey Alex!
- We're using EBS.
- JVM is 1.7.0_25 across all nodes.
- Elasticsearch is 0.90.7 across all nodes.
Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...
-bryan
On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen
wrote:
Hey,
is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?
--Alex
On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.comwrote:
It looks like we're back to not having a good primary here anymore,
both shards are either RECOVERING/UNASSIGNED.
(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b5
1-45e9-afec-dcf0efa2c2fd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e049dc54-23bf-4f52-82aa-95d9ab290d2c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/980f7be5-7a5c-4119-8f80-425b91a26f6a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.