Indexing hangs in 0.19.3 and allocations stuck - sound familiar at all?

Hi All,
We've been running 0.19.3 happily in prod for at least a year on a few
production clusters, with no problem, haven't even needed to restart
anything. We are in the process of upgrading to latest 0.90 and this is
just a feeler to see if anybody else hit a similar issue to this in order
to properly postmortem.

Here is basically the chain of events:

  • Cleaned up an old index and swapped in a new one, which also tweaks
    replica count from 0 to 1. Just mentioning, because it is the only thing
    that we did remotely close to things going wrong.
  • 9 hours later an index began hanging requests to index new docs.
  • This caused our index queues to get backed up and some monitoring alarms
    to start going off, so were aware of the issue
  • The cluster state was green and did the following to try to resolve:
  • Restarted our indexer application that got things going again for a few
    minutes, but things got gummed up shortly again.
  • Set replicas down to 0 and then back to 1 for the suspect indexes
  • The new replicas couldn't recover and were stuck in initializing, so
    cluster was in yellow state. This was interesting.
  • Opened and closed the indexes that could potentially be the problem. Made
    no difference.
  • Increased concurrent recoveries (from 1 to 5). This got me down to 9
    shards stuck in init.
  • I tried creating a new index to rebuild some content I suspected was
    corrupt and that new index pushed the cluster state to red and was stuck
    trying to init.
  • At this point, I decided it was best to restart the cluster. Things came
    up clean and I don't believe there was any data corruption.

Does this sound familiar to anyone?

Here are a few bugs that I think could be related:



Many thanks for taking the time to read.

Best Regards,
Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I had also uninitialized shards in 0.19 and they went away after full
restart. Is there something in the log files? Maybe file descriptor usage
exceeding, maybe other unexpected resource shortage? I went to 0.19.11
then, but I can't tell if this helped or not.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The elasticsearch logs are silent about the issue. File descriptors are
fine, although, I have seen what you're describing in our acceptance env on
0.19.3 where file descriptor usage just starts climbing. We default to 128K
open and start alarming at 64K.

RAM, disk, cpu all look fine. Documents coming in from a different queue
kept flowing without issue, it was a certain index that went AWOL on
indexing side.

Thanks!
Paul

On Monday, August 26, 2013 1:26:56 PM UTC-6, Jörg Prante wrote:

I had also uninitialized shards in 0.19 and they went away after full
restart. Is there something in the log files? Maybe file descriptor usage
exceeding, maybe other unexpected resource shortage? I went to 0.19.11
then, but I can't tell if this helped or not.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.