CorruptIndexException: possibly transient resource issue, or a Lucene or JVM bug

(David Roberts) #1

We have seen an issue with Elasticsearch 2.1.2 where the .kibana index somehow got into a corrupt state and a sequence of warning messages was repeatedly logged to the Elasticsearch log (see the end of this post).

One of the exceptions in the stack trace contains the text "possibly transient resource issue, or a Lucene or JVM bug".

Now, according to this Lucene users thread the cause of this message was fixed in LUCENE-6970, which went into Lucene 5.4.1, which according to #16160 is in Elasticsearch 2.2.

So, my questions are:

  1. Is the problem of LUCENE-6970 the only thing that could cause the message "possibly transient resource issue, or a Lucene or JVM bug", or could the index corruption problem we observed be something completely different?
  2. Is it a bit of a worry that Elasticsearch continually logged the same sequence of messages over and over again until we killed it? It was logging the lines at the end of this post several times per second for hours, generating a 6GB log file by the time we killed it. Should there be a point at which you stop trying to recover a failed shard after X retries?

The sequence of log messages is as follows. As I said, this same sequence was logged several times per second until Elasticsearch was killed. (Sorry - this has to be a picture rather than text due to the limit on post length.)

(system) #3