ES 0.90.0,0.90.1 leaking open file handles?

Hi everyone,

We've been seeing an open file handle issue on our busiest index (the
hopper index). It seems to be keeping open the handles of deleted Lucene
files and after reaching a certain threshold, the JVM seems to just stop
responding. I don't know enough about ES yet to debug it properly but I was
wondering what I should look for. I'm going to try to describe our current
setup along with the troubleshooting I did.

Our config:

  • 6 node cluster with

    • 3 nodes data=true, master=true
    • 3 nodes data=false, master=false
  • 2 indices on that cluster.

I've attached a sample config file for the 3 nodes containing the data and
a cluster health after one of the nodes failed.

Troubleshooting:

  • I'd like to paste a curl recreation of the issue but I'm not sure of
    exactly what is causing the issue. I'm thinking an issue with the merge of
    segments.

  • My current understanding of our usage of the cluster is we're using
    mainly the transport and REST clients to perform actions on the index.

  • With our current usage, it takes about 24h to blow up. The usual symptom
    is connections starting to timeout while doing searches until the node
    completely dies.

  • I'm attaching a list of the open files for deleted files, there are
    several thousands of them.

  • We had the issue with 0.90.0RC2, 0.90.0 and 0.90.1.

  • I've found this which looks like what I'm seeing but not exactly my
    problem:

https://groups.google.com/forum/#!msg/elasticsearch/dVY_BVfHUkw/i8m5o39KxKUJ

  • I've found the following past issues which I don't think are related:

  • I've looked a bit at the heap dump with jhat but I'm not good enough
    with the code and with these tools to properly analyze the dump. It's
    currently about 11GB in size. If someone wants it, I'll be more than happy
    to provide it in some way.

  • While looking at the heap dump, I've seen a lot of FileDescriptor objects
    belonging to "Lucece41PostingsReader" and a lot also to unresolved parents
    (these last ones I think would have been collected next GC run).

  • On other clusters, we see that number of deleted open file handle going
    up and down but for an unknown reason, with this index (the hopper index),
    it goes up and down but mainly goes up.

Any ideas what we should look for and how we can troubleshoot this further?

Thanks,

Sebastien Coutu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.