Could not lock IndexWriter isLocked [false] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock

So, a little more digging and it looks like it was holding onto a
write.lock that was gone.

sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the
Machine/nodes/0/indices/zapier_legacy/0/index/write.lock (deleted)

We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.

On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:

We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.

On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:

All replicas have the same corruption, it seems. We can't get a primary
up for shard 0, therefore the replica never comes up. Does that make sense?

On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:

Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"

Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.

I'm quite puzzled, you don't have a replica shard? The "CheckIndex" is
really a last resort if there are no replica, and it is not the preferred
method to ensure data integrity in ES...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/974f400c-222b-454b-b997-0a7ca7955ec7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Okay, a combination of CheckIndex -fix, careful manual allocation of shard
0, and restarts to clear the lock files has resulted in a green cluster.

On Tuesday, December 17, 2013 11:51:04 PM UTC-8, Bryan Helmig wrote:

So, a little more digging and it looks like it was holding onto a
write.lock that was gone.

sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the
Machine/nodes/0/indices/zapier_legacy/0/index/write.lock (deleted)

We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.

On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:

We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.

On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:

All replicas have the same corruption, it seems. We can't get a primary
up for shard 0, therefore the replica never comes up. Does that make sense?

On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:

Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"

Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.

I'm quite puzzled, you don't have a replica shard? The "CheckIndex" is
really a last resort if there are no replica, and it is not the preferred
method to ensure data integrity in ES...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f5413107-6701-4081-9e2c-be7035865cfb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

great, you got it running again. The replica corruption thing makes sense,
btw.
Do you still have a stack trace of the OOM exception you found first? Would
like to see what caused it and maybe what one can do about it in the
future, if there is more information.

--Alex

On Wed, Dec 18, 2013 at 9:12 AM, Bryan Helmig bryan@zapier.com wrote:

Okay, a combination of CheckIndex -fix, careful manual allocation of shard
0, and restarts to clear the lock files has resulted in a green cluster.

On Tuesday, December 17, 2013 11:51:04 PM UTC-8, Bryan Helmig wrote:

So, a little more digging and it looks like it was holding onto a
write.lock that was gone.

sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the Machine/nodes/0/indices/
zapier_legacy/0/index/write.lock (deleted)

We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.

On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:

We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.

On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:

All replicas have the same corruption, it seems. We can't get a primary
up for shard 0, therefore the replica never comes up. Does that make sense?

On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:

Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"

Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.

I'm quite puzzled, you don't have a replica shard? The "CheckIndex"
is really a last resort if there are no replica, and it is not the
preferred method to ensure data integrity in ES...

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f5413107-6701-4081-9e2c-be7035865cfb%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9jTRMbAo0V%2B8VqdwKm-ETcf3_%2BzFHvcMX4skv%3DEwO8-w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sure, I'm happy to bundle up all logs and ship them to you guys. Assuming
zipped in an email is fine?

We think the OOM caused the caused corruption which later led write.lock
file handles being left open on the ES process when hitting EOF errors
(which seemed like a bug, but I'm not very versed in ES failure scenarios),
so IndexLock was a bit of a red herring until we found the underlying
corruption. In fact, it would often accept writes/queries for a while, I
assume until it tried to read the broken segment and broke (perhaps trying
to promote a different, also broken shard to primary).

In hindsight, simply fixing the Lucene segments and restarting the entire
cluster (to clear file handles) would have done the trick, but since this
was production we wanted to do it one node at a time.

On Wednesday, December 18, 2013 1:03:13 AM UTC-8, Alexander Reelsen wrote:

Hey,

great, you got it running again. The replica corruption thing makes sense,
btw.
Do you still have a stack trace of the OOM exception you found first?
Would like to see what caused it and maybe what one can do about it in the
future, if there is more information.

--Alex

On Wed, Dec 18, 2013 at 9:12 AM, Bryan Helmig <br...@zapier.com<javascript:>

wrote:

Okay, a combination of CheckIndex -fix, careful manual allocation of
shard 0, and restarts to clear the lock files has resulted in a green
cluster.

On Tuesday, December 17, 2013 11:51:04 PM UTC-8, Bryan Helmig wrote:

So, a little more digging and it looks like it was holding onto a
write.lock that was gone.

sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the Machine/nodes/0/indices/
zapier_legacy/0/index/write.lock (deleted)

We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.

On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:

We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.

On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:

All replicas have the same corruption, it seems. We can't get a
primary up for shard 0, therefore the replica never comes up. Does that
make sense?

On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:

Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"

Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.

I'm quite puzzled, you don't have a replica shard? The "CheckIndex"
is really a last resort if there are no replica, and it is not the
preferred method to ensure data integrity in ES...

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f5413107-6701-4081-9e2c-be7035865cfb%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/70b5a680-de43-46c6-aae2-5d76352181e2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.