Could not lock IndexWriter isLocked [false] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock

Here is are some logs of the start of the incident

And basically these logs over and over:

A little background:

The cluster is 3 nodes on AWS & EBS, 100 shards (50 primaries & 50
replicas) and just this single shard (so far) got corrupted (?). We're at
about 800gb of data and we're using routing keys to keep it all (mostly)
sane among shards. Here is the topograph of the cluster from ES Head:

I think it happened as it tried to relocate a shard. Now it refuses to
start the engine?

Thanks!
-bryan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba71ef6b-a3b3-4631-a7f4-6a03379729f8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We've restarted that node and it seemed to be working its way back to
normality...

But the LockReleaseFailedException is here to stay:

[2013-12-17 04:43:53,962][WARN ][cluster.action.shard ] [Porcupine]
[zapier_legacy][0] sending failed shard for [zapier_legacy][0],
node[QToCnTWtQLCWySMnbjm2IQ], [P], s[INITIALIZING], indexUUID
[pzWL-WO_SsaGbuWfn2IQaw], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[zapier_legacy][0] failed recovery];
nested: EngineCreationFailureException[[zapier_legacy][0] failed to create
engine]; nested: LockReleaseFailedException[Cannot forcefully unlock a
NativeFSLock which is held by another indexer component:
/var/data/elasticsearch/Rage Against the
Machine/nodes/0/indices/zapier_legacy/0/index/write.lock]; ]]

Any thoughts.

On Monday, December 16, 2013 8:25:52 PM UTC-8, Bryan Helmig wrote:

Here is are some logs of the start of the incident

gist:3c17edfe5c4e9065e5a3 · GitHub

And basically these logs over and over:

gist:cfb9303bc033a1183701 · GitHub

A little background:

The cluster is 3 nodes on AWS & EBS, 100 shards (50 primaries & 50
replicas) and just this single shard (so far) got corrupted (?). We're at
about 800gb of data and we're using routing keys to keep it all (mostly)
sane among shards. Here is the topograph of the cluster from ES Head:

http://i.imgur.com/zJa9Beh.png

I think it happened as it tried to relocate a shard. Now it refuses to
start the engine?

Thanks!
-bryan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a153920-30da-40a2-8ca5-186442169c12%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We also tried taking the box down, nuking that leftover index.lock, and
bringing it back up (to the same effect).

On Monday, December 16, 2013 8:46:02 PM UTC-8, Bryan Helmig wrote:

We've restarted that node and it seemed to be working its way back to
normality...

But the LockReleaseFailedException is here to stay:

[2013-12-17 04:43:53,962][WARN ][cluster.action.shard ] [Porcupine]
[zapier_legacy][0] sending failed shard for [zapier_legacy][0],
node[QToCnTWtQLCWySMnbjm2IQ], [P], s[INITIALIZING], indexUUID
[pzWL-WO_SsaGbuWfn2IQaw], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[zapier_legacy][0] failed recovery];
nested: EngineCreationFailureException[[zapier_legacy][0] failed to create
engine]; nested: LockReleaseFailedException[Cannot forcefully unlock a
NativeFSLock which is held by another indexer component:
/var/data/elasticsearch/Rage Against the
Machine/nodes/0/indices/zapier_legacy/0/index/write.lock]; ]]

Any thoughts.

On Monday, December 16, 2013 8:25:52 PM UTC-8, Bryan Helmig wrote:

Here is are some logs of the start of the incident

gist:3c17edfe5c4e9065e5a3 · GitHub

And basically these logs over and over:

gist:cfb9303bc033a1183701 · GitHub

A little background:

The cluster is 3 nodes on AWS & EBS, 100 shards (50 primaries & 50
replicas) and just this single shard (so far) got corrupted (?). We're at
about 800gb of data and we're using routing keys to keep it all (mostly)
sane among shards. Here is the topograph of the cluster from ES Head:

http://i.imgur.com/zJa9Beh.png

I think it happened as it tried to relocate a shard. Now it refuses to
start the engine?

Thanks!
-bryan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5255d559-d78f-4d25-beb1-7c6633e6c27e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Actually, after shutting down the trouble node and bringing it up again,
we've got a primary on the trouble shard. Now the other nodes are refusing
to be replicas, perhaps due to something around the write.lock as well
since they are getting the same error. It seems to just be shuffling the
forever failing "recovering" shard around attempting to get the replica
going.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1ee84237-e3a9-435a-9c95-90d1161a1e28%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It looks like we're back to not having a good primary here anymore, both
shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the bottom
of this one.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig bryan@zapier.com wrote:

It looks like we're back to not having a good primary here anymore, both
shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9ybPb5RaX0CUGdGicTMSzDa3xa3L0CC2swuAMLWQ89pQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig <br...@zapier.com<javascript:>

wrote:

It looks like we're back to not having a good primary here anymore, both
shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

is it possible that there is actually an exception happening before the
NativeFSLock exception occured? Running out of disk space or file handles
or something like that?

--Alex

On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig bryan@zapier.com wrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.com wrote:

It looks like we're back to not having a good primary here anymore, both
shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM97wDB4PCcqWtMDt0TbN0ic6ZNt0NqXpqcpCMhkvc3myg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

BTW, nothing particularly unique on top of EBS for the filesystem. Pretty
vanilla.

On Tuesday, December 17, 2013 11:02:27 AM UTC-8, Bryan Helmig wrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.com wrote:

It looks like we're back to not having a good primary here anymore, both
shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/46eabea5-4d48-46b6-8894-df96f37cddbf%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We're at "max_file_descriptors" : 65535 right now, and I haven't seen
anything around file handles in the logs (and we have plenty of disk
space). We did have an OOM exception due to a misconfigured heap a few days
ago, but we did a rolling restart after a fix and it all seemed fine.

gist:091839e6a48a4e103699 · GitHub has the full log
with the exception repeated over and
over. gist:3c17edfe5c4e9065e5a3 · GitHub was the
first log that had an error, which has a few other interesting lines like:

"MergeException[java.io.EOFException: read past EOF: NIOFSIndexInput ..."

On Tuesday, December 17, 2013 11:15:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is it possible that there is actually an exception happening before the
NativeFSLock exception occured? Running out of disk space or file handles
or something like that?

--Alex

On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig <br...@zapier.com<javascript:>

wrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.com wrote:

It looks like we're back to not having a good primary here anymore,
both shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e049dc54-23bf-4f52-82aa-95d9ab290d2c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

somehow your index data is corrupt. You could set
'index.shard.check_on_startup' to true and check its output. This triggers
a lucene CheckIndex, which might rewrite your segments file if it runs
successfully.

--Alex

On Tue, Dec 17, 2013 at 8:25 PM, Bryan Helmig bryan@zapier.com wrote:

We're at "max_file_descriptors" : 65535 right now, and I haven't seen
anything around file handles in the logs (and we have plenty of disk
space). We did have an OOM exception due to a misconfigured heap a few days
ago, but we did a rolling restart after a fix and it all seemed fine.

gist:091839e6a48a4e103699 · GitHub has the full log
with the exception repeated over and over.
gist:3c17edfe5c4e9065e5a3 · GitHub was the first
log that had an error, which has a few other interesting lines like:

"MergeException[java.io.EOFException: read past EOF: NIOFSIndexInput ..."

On Tuesday, December 17, 2013 11:15:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is it possible that there is actually an exception happening before the
NativeFSLock exception occured? Running out of disk space or file handles
or something like that?

--Alex

On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig br...@zapier.com wrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.com wrote:

It looks like we're back to not having a good primary here anymore,
both shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%40goo
glegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e049dc54-23bf-4f52-82aa-95d9ab290d2c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_M9mj1k-bZgh8SLsxVtJvzr13J8Gny80g02mzo3xOYrg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

na, sent too early. What OOM exception were you hitting? Was this due to
querying your data? Just trying to make sure, there was nothing
out-of-ordinary which triggered that corruption.

--Alex

On Tue, Dec 17, 2013 at 11:36 PM, Alexander Reelsen alr@spinscale.dewrote:

Hey,

somehow your index data is corrupt. You could set
'index.shard.check_on_startup' to true and check its output. This triggers
a lucene CheckIndex, which might rewrite your segments file if it runs
successfully.

--Alex

On Tue, Dec 17, 2013 at 8:25 PM, Bryan Helmig bryan@zapier.com wrote:

We're at "max_file_descriptors" : 65535 right now, and I haven't seen
anything around file handles in the logs (and we have plenty of disk
space). We did have an OOM exception due to a misconfigured heap a few days
ago, but we did a rolling restart after a fix and it all seemed fine.

gist:091839e6a48a4e103699 · GitHub has the full
log with the exception repeated over and over.
gist:3c17edfe5c4e9065e5a3 · GitHub was the first
log that had an error, which has a few other interesting lines like:

"MergeException[java.io.EOFException: read past EOF: NIOFSIndexInput ..."

On Tuesday, December 17, 2013 11:15:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is it possible that there is actually an exception happening before the
NativeFSLock exception occured? Running out of disk space or file handles
or something like that?

--Alex

On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig br...@zapier.com wrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.comwrote:

It looks like we're back to not having a good primary here anymore,
both shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b4f88f23-1b51-45e9-afec-dcf0efa2c2fd%40goo
glegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e049dc54-23bf-4f52-82aa-95d9ab290d2c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM8ROndoA6AgpOWh3f_78GgZ%2B2d6ne_wgj0CpspGG4WPHA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

We think the OOM was the main culprit for sure now, here is the earliest
logs pointing out the trouble. The shard 0 corruption seems to be immediate:

[2013-12-15 23:46:15,379][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:13,930][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:48,691][WARN ][index.engine.robin ] [Centurious?]
[zapier_legacy][0] failed engine
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:48,690][WARN ][index.merge.scheduler ] [Centurious?]
[zapier_legacy][6] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2013-12-15 23:46:52,573][DEBUG][action.index ] [Centurious?]
[zapier_legacy][0], node[tRG516VtQ5OdU8eNpga44g], [P], s[STARTED]: Failed
to execute [index {...}]
org.elasticsearch.index.engine.IndexFailedEngineException:
[zapier_legacy][0] Index failed for
[write#599581e0-e0a9-4017-a961-75bbbaeff4c4]
[2013-12-15 23:46:52,630][WARN ][index.shard.service ] [Centurious?]
[zapier_legacy][0] Failed to perform scheduled engine refresh
org.elasticsearch.index.engine.RefreshFailedEngineException:
[zapier_legacy][0] Refresh failed
[2013-12-15 23:46:52,933][WARN ][cluster.action.shard ] [Centurious?]
[zapier_legacy][0] sending failed shard for
[zapier_legacy][0], node[tRG516VtQ5OdU8eNpga44g], [P], s[STARTED],
indexUUID [pzWL-WO_SsaGbuWfn2IQaw], reason [engin
e failure, message [OutOfMemoryError[Java heap space]]]
[2013-12-15 23:46:59,249][WARN ][indices.cluster ] [Centurious?]
[zapier_legacy][0] master [[Hera][_WWQvOUIR
Km5F-A822JsKw][inet[/10.211.7.15:9300]]{aws_availability_zone=us-east-1c}]
marked shard as started, but shard has not
been created, mark shard as failed
[2013-12-15 23:46:59,249][WARN ][cluster.action.shard ] [Centurious?]
[zapier_legacy][0] sending failed shard for [zapier_legacy][0],
node[tRG516VtQ5OdU8eNpga44g], [P], s[STARTED], indexUUID
[pzWL-WO_SsaGbuWfn2IQaw], reason [master
[Hera][_WWQvOUIRKm5F-A822JsKw][inet[/10.211.7.15:9300]]{aws_availability_zone=us-east-1c}
marked shard as started, but shard has not been created, mark shard as
failed]

If we restart a single node with the index.shard.check_on_startup: true,
would that suffice or does it need to be a cluster restart?

On Tuesday, December 17, 2013 2:37:36 PM UTC-8, Alexander Reelsen wrote:

Hey,

na, sent too early. What OOM exception were you hitting? Was this due to
querying your data? Just trying to make sure, there was nothing
out-of-ordinary which triggered that corruption.

--Alex

On Tue, Dec 17, 2013 at 11:36 PM, Alexander Reelsen <a...@spinscale.de<javascript:>

wrote:

Hey,

somehow your index data is corrupt. You could set
'index.shard.check_on_startup' to true and check its output. This triggers
a lucene CheckIndex, which might rewrite your segments file if it runs
successfully.

--Alex

On Tue, Dec 17, 2013 at 8:25 PM, Bryan Helmig <br...@zapier.com<javascript:>

wrote:

We're at "max_file_descriptors" : 65535 right now, and I haven't seen
anything around file handles in the logs (and we have plenty of disk
space). We did have an OOM exception due to a misconfigured heap a few days
ago, but we did a rolling restart after a fix and it all seemed fine.

gist:091839e6a48a4e103699 · GitHub has the full
log with the exception repeated over and over.
gist:3c17edfe5c4e9065e5a3 · GitHub was the first
log that had an error, which has a few other interesting lines like:

"MergeException[java.io.EOFException: read past EOF: NIOFSIndexInput ..."

On Tuesday, December 17, 2013 11:15:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is it possible that there is actually an exception happening before the
NativeFSLock exception occured? Running out of disk space or file handles
or something like that?

--Alex

On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig br...@zapier.com wrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen
wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.comwrote:

It looks like we're back to not having a good primary here anymore,
both shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b5
1-45e9-afec-dcf0efa2c2fd%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e049dc54-23bf-4f52-82aa-95d9ab290d2c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/980f7be5-7a5c-4119-8f80-425b91a26f6a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It looks like we might be able to try the CheckIndex (ALA
Redirecting to Google Groups),
or perhaps that is the same as "index.shard.check_on_startup: true"...

On Tuesday, December 17, 2013 2:37:36 PM UTC-8, Alexander Reelsen wrote:

Hey,

na, sent too early. What OOM exception were you hitting? Was this due to
querying your data? Just trying to make sure, there was nothing
out-of-ordinary which triggered that corruption.

--Alex

On Tue, Dec 17, 2013 at 11:36 PM, Alexander Reelsen <a...@spinscale.de<javascript:>

wrote:

Hey,

somehow your index data is corrupt. You could set
'index.shard.check_on_startup' to true and check its output. This triggers
a lucene CheckIndex, which might rewrite your segments file if it runs
successfully.

--Alex

On Tue, Dec 17, 2013 at 8:25 PM, Bryan Helmig <br...@zapier.com<javascript:>

wrote:

We're at "max_file_descriptors" : 65535 right now, and I haven't seen
anything around file handles in the logs (and we have plenty of disk
space). We did have an OOM exception due to a misconfigured heap a few days
ago, but we did a rolling restart after a fix and it all seemed fine.

gist:091839e6a48a4e103699 · GitHub has the full
log with the exception repeated over and over.
gist:3c17edfe5c4e9065e5a3 · GitHub was the first
log that had an error, which has a few other interesting lines like:

"MergeException[java.io.EOFException: read past EOF: NIOFSIndexInput ..."

On Tuesday, December 17, 2013 11:15:36 AM UTC-8, Alexander Reelsen wrote:

Hey,

is it possible that there is actually an exception happening before the
NativeFSLock exception occured? Running out of disk space or file handles
or something like that?

--Alex

On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig br...@zapier.com wrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen
wrote:

Hey,

is your filesystem used to store data a network file system or just a
normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.comwrote:

It looks like we're back to not having a good primary here anymore,
both shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to the
bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b5
1-45e9-afec-dcf0efa2c2fd%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e049dc54-23bf-4f52-82aa-95d9ab290d2c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a4dd96e3-9aad-4e2e-8952-a8642c37aaa8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So, running on all nodes:

java -cp
:/usr/share/elasticsearch/lib/elasticsearch-0.90.7.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
"/var/data/elasticsearch/Rage Against the
Machine/nodes/0/indices/zapier_legacy/0/index/"

Gives us the same result:

WARNING: 1 broken segments (containing 4444 documents) detected
WARNING: would write new segments file, and 4444 documents would be lost,
if -fix were specified

Now I we're trying to consider the proper order of operations to fix this
(IE: can CheckIndex be ran on a live node, should we shut down one node
first apply the fix and bring it back, etc.)

On Tuesday, December 17, 2013 3:05:35 PM UTC-8, Bryan Helmig wrote:

It looks like we might be able to try the CheckIndex (ALA
Redirecting to Google Groups),
or perhaps that is the same as "index.shard.check_on_startup: true"...

On Tuesday, December 17, 2013 2:37:36 PM UTC-8, Alexander Reelsen wrote:

Hey,

na, sent too early. What OOM exception were you hitting? Was this due to
querying your data? Just trying to make sure, there was nothing
out-of-ordinary which triggered that corruption.

--Alex

On Tue, Dec 17, 2013 at 11:36 PM, Alexander Reelsen a...@spinscale.dewrote:

Hey,

somehow your index data is corrupt. You could set
'index.shard.check_on_startup' to true and check its output. This triggers
a lucene CheckIndex, which might rewrite your segments file if it runs
successfully.

--Alex

On Tue, Dec 17, 2013 at 8:25 PM, Bryan Helmig br...@zapier.com wrote:

We're at "max_file_descriptors" : 65535 right now, and I haven't seen
anything around file handles in the logs (and we have plenty of disk
space). We did have an OOM exception due to a misconfigured heap a few days
ago, but we did a rolling restart after a fix and it all seemed fine.

gist:091839e6a48a4e103699 · GitHub has the full
log with the exception repeated over and over.
gist:3c17edfe5c4e9065e5a3 · GitHub was the first
log that had an error, which has a few other interesting lines like:

"MergeException[java.io.EOFException: read past EOF: NIOFSIndexInput
..."

On Tuesday, December 17, 2013 11:15:36 AM UTC-8, Alexander Reelsen
wrote:

Hey,

is it possible that there is actually an exception happening before
the NativeFSLock exception occured? Running out of disk space or file
handles or something like that?

--Alex

On Tue, Dec 17, 2013 at 8:02 PM, Bryan Helmig br...@zapier.comwrote:

Hey Alex!

  1. We're using EBS.
  2. JVM is 1.7.0_25 across all nodes.
  3. Elasticsearch is 0.90.7 across all nodes.

Right now the cluster is stable, albeit with one shard with no
primaries/replicas started (one is never ending recover and one is
unassigned). Its weird because for a short period of time last night, it
had a primary (broken replica though) and was growing in size. That good
fortune has not returned...

-bryan

On Tuesday, December 17, 2013 8:16:36 AM UTC-8, Alexander Reelsen
wrote:

Hey,

is your filesystem used to store data a network file system or just
a normal one? If not, any special file system type?
Do you have an up-to-date JVM version?
Do you have an up-to-date elasticsearch version?

--Alex

On Tue, Dec 17, 2013 at 6:42 AM, Bryan Helmig br...@zapier.comwrote:

It looks like we're back to not having a good primary here anymore,
both shards are either RECOVERING/UNASSIGNED.

(Sorry for the constant stream of updates, just trying to get to
the bottom of this one.)

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b4f88f23-1b5
1-45e9-afec-dcf0efa2c2fd%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/ecd680d0-4d22-41c8-86b7-823fe7c33216%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e049dc54-23bf-4f52-82aa-95d9ab290d2c%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a68c059f-e557-4f1f-b13e-8fc9df2bbebd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I know this exception from OOMs, too, when heap got low.

You should identify the corrupted shard and make a filesystem copy of it so
you do not lose files.

I can not recommend Lucene CheckIndex, because ES uses a modified Lucene 4
index, and may not be able to simply pick up an index "repaired" by Lucene
(the "repair" is dropping docs)

I have to test if "index.shard.check_on_startup: fix" works at all , it was
in the Lucene 3.6 days when it worked quite ok. Since then a lot changed.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE6nV6y37FbEpaAHSmvBkfrezE1wM9OStJozJZwGL%2BH-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

We've definitely found the corrupted shard it seems (shard 0 is corrupted
in the same way across all nodes, all other shards seem to check out fine).

Is it worth making a filesystem backup first, and trying the vanilla
CheckIndex -fix or should we wait for your "index.shard.check_on_startup:
fix" test? Also, can we assume that if one node is restarted with the fixed
shard that the other nodes will replicate from the fixed shard?

On Tuesday, December 17, 2013 5:43:20 PM UTC-8, Jörg Prante wrote:

I know this exception from OOMs, too, when heap got low.

You should identify the corrupted shard and make a filesystem copy of it
so you do not lose files.

I can not recommend Lucene CheckIndex, because ES uses a modified Lucene 4
index, and may not be able to simply pick up an index "repaired" by Lucene
(the "repair" is dropping docs)

I have to test if "index.shard.check_on_startup: fix" works at all , it
was in the Lucene 3.6 days when it worked quite ok. Since then a lot
changed.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/27d18fab-f521-4143-8db5-73cbfef5d5b8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hm, just wanted to clarify that I'm not familiar with the effects of latest
ES on Lucene 4 "index.shard.check_on_startup: fix"

Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.

I'm quite puzzled, you don't have a replica shard? The "CheckIndex" is
really a last resort if there are no replica, and it is not the preferred
method to ensure data integrity in ES...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdExc-v2Op3tVuk4EHRdb_RHYA8KJE%3DRvHAZxdYp_UzQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

All replicas have the same corruption, it seems. We can't get a primary up
for shard 0, therefore the replica never comes up. Does that make sense?

On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:

Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"

Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.

I'm quite puzzled, you don't have a replica shard? The "CheckIndex" is
really a last resort if there are no replica, and it is not the preferred
method to ensure data integrity in ES...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8f6e7f9-8a3b-44ba-a6c1-ae00dd7bd67d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We're also fine with loosing a few docs as we can reindex them from another
source, so dropping the documents works for us.

On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:

All replicas have the same corruption, it seems. We can't get a primary up
for shard 0, therefore the replica never comes up. Does that make sense?

On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:

Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"

Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.

I'm quite puzzled, you don't have a replica shard? The "CheckIndex" is
really a last resort if there are no replica, and it is not the preferred
method to ensure data integrity in ES...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53975784-9762-483e-aadd-e22fc304c6f3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.