So, a little more digging and it looks like it was holding onto a
write.lock that was gone.
sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the
Machine/nodes/0/indices/zapier_legacy/0/index/write.lock (deleted)
We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.
On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:
We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.
On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:
All replicas have the same corruption, it seems. We can't get a primary
up for shard 0, therefore the replica never comes up. Does that make sense?
On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:
Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"
Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.
I'm quite puzzled, you don't have a replica shard? The "CheckIndex" is
really a last resort if there are no replica, and it is not the preferred
method to ensure data integrity in ES...
Okay, a combination of CheckIndex -fix, careful manual allocation of shard
0, and restarts to clear the lock files has resulted in a green cluster.
On Tuesday, December 17, 2013 11:51:04 PM UTC-8, Bryan Helmig wrote:
So, a little more digging and it looks like it was holding onto a
write.lock that was gone.
sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the
Machine/nodes/0/indices/zapier_legacy/0/index/write.lock (deleted)
We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.
On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:
We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.
On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:
All replicas have the same corruption, it seems. We can't get a primary
up for shard 0, therefore the replica never comes up. Does that make sense?
On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:
Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"
Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.
I'm quite puzzled, you don't have a replica shard? The "CheckIndex" is
really a last resort if there are no replica, and it is not the preferred
method to ensure data integrity in ES...
great, you got it running again. The replica corruption thing makes sense,
btw.
Do you still have a stack trace of the OOM exception you found first? Would
like to see what caused it and maybe what one can do about it in the
future, if there is more information.
--Alex
On Wed, Dec 18, 2013 at 9:12 AM, Bryan Helmig bryan@zapier.com wrote:
Okay, a combination of CheckIndex -fix, careful manual allocation of shard
0, and restarts to clear the lock files has resulted in a green cluster.
On Tuesday, December 17, 2013 11:51:04 PM UTC-8, Bryan Helmig wrote:
So, a little more digging and it looks like it was holding onto a
write.lock that was gone.
sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the Machine/nodes/0/indices/
zapier_legacy/0/index/write.lock (deleted)
We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.
On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:
We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.
On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:
All replicas have the same corruption, it seems. We can't get a primary
up for shard 0, therefore the replica never comes up. Does that make sense?
On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:
Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"
Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.
I'm quite puzzled, you don't have a replica shard? The "CheckIndex"
is really a last resort if there are no replica, and it is not the
preferred method to ensure data integrity in ES...
Sure, I'm happy to bundle up all logs and ship them to you guys. Assuming
zipped in an email is fine?
We think the OOM caused the caused corruption which later led write.lock
file handles being left open on the ES process when hitting EOF errors
(which seemed like a bug, but I'm not very versed in ES failure scenarios),
so IndexLock was a bit of a red herring until we found the underlying
corruption. In fact, it would often accept writes/queries for a while, I
assume until it tried to read the broken segment and broke (perhaps trying
to promote a different, also broken shard to primary).
In hindsight, simply fixing the Lucene segments and restarting the entire
cluster (to clear file handles) would have done the trick, but since this
was production we wanted to do it one node at a time.
On Wednesday, December 18, 2013 1:03:13 AM UTC-8, Alexander Reelsen wrote:
Hey,
great, you got it running again. The replica corruption thing makes sense,
btw.
Do you still have a stack trace of the OOM exception you found first?
Would like to see what caused it and maybe what one can do about it in the
future, if there is more information.
--Alex
On Wed, Dec 18, 2013 at 9:12 AM, Bryan Helmig <br...@zapier.com<javascript:>
wrote:
Okay, a combination of CheckIndex -fix, careful manual allocation of
shard 0, and restarts to clear the lock files has resulted in a green
cluster.
On Tuesday, December 17, 2013 11:51:04 PM UTC-8, Bryan Helmig wrote:
So, a little more digging and it looks like it was holding onto a
write.lock that was gone.
sudo lsof -uelasticsearch | grep 'legacy/0'
java 27517 elasticsearch 1042uW REG 202,1 0
525279 /var/data/elasticsearch/Rage Against the Machine/nodes/0/indices/
zapier_legacy/0/index/write.lock (deleted)
We did delete some leftover lock files after the nodes powered down, but
that seems like it shouldn't have caused this (unless we made a mistake and
nuked it on a live instance). Somehow that plus the OOM corruption led to a
pretty crazy situation. We're almost back from it after some restarts, we
should be able to have a blog post on the situation after. I'll follow up
with results and a link ASAP.
On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:
We're also fine with loosing a few docs as we can reindex them from
another source, so dropping the documents works for us.
On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:
All replicas have the same corruption, it seems. We can't get a
primary up for shard 0, therefore the replica never comes up. Does that
make sense?
On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:
Hm, just wanted to clarify that I'm not familiar with the effects of
latest ES on Lucene 4 "index.shard.check_on_startup: fix"
Even if I can test it, there is no guarantee that it works for you.
Different systems, different index, different corruptions... who knows.
I'm quite puzzled, you don't have a replica shard? The "CheckIndex"
is really a last resort if there are no replica, and it is not the
preferred method to ensure data integrity in ES...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.