Shards fail to start up when they have > 2 billion documents

Sometimes, shards will not start up and they are complaining about having too many documents. Each shard has > 2 billion documents:

[WARN ][cluster.action.shard ] [node] [index][shard] received shard failed for [index][shard], node[node_id], [R], s[STARTED], indexUUID [index_id], reason [engine failure, message [refresh failed][IllegalArgumentException[Too many documents, composite IndexReaders cannot exceed 2147483519]]]

This is due to the shard exceeding the Lucene limitation on max # of documents per Lucene index.

Lucene has a max # of documents per Lucene index limitation. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128). In earlier versions of Lucene, if you ingest enough documents to exceed this limit, the index can become unusable and the Elasticsearch shard cannot be started.

SOLUTION

Lucene has since added enforcement checks to prevent the index from being corrupted when it exceeds the max # of documents limit:

https://issues.apache.org/jira/browse/LUCENE-5843 (available in ES 1.4 and above)
https://issues.apache.org/jira/browse/LUCENE-6299 (will be available in the next release after 1.4.4)

Keep in mind that even with the Lucene changes above, if you do hit the limit, you will not be able to add more documents to the index (but will still be able to search it).

Until you have both fixes above, it is helpful to monitor shard sizes using the _cat/shards api to prevent them from exceeding the Lucene limit and becoming unusable.

If you see that the shards are close to hitting the limit, you can update the index settings to prevent all writes to the index (See index.blocks.write setting). Then create a new index and start writing to it. Or reindex with a larger number of shards.

1 Like

I have Elasticsearch 1.4.4
I have a shard that has hit this limit and it keeps failing to recover.
I move/delete the translogs for this shard, but it just fails with the message
"Too many documents, composite IndexReaders cannot exceed 2147483519" in the log
There are way less documents in this shard than the limit, but i think it is counting deleted documents too.
I have never done any kind of optimize cmd on the index and unfortunately i have no replicas for this index as I am still getting to know elasticsearch and with the size of the index at 12 shards and with how much memory it uses for parent/child relationships i was not willing to try replicas because I was told/researched that if you add a replica, it basically increases the memory usage by the same amount as a single primary.

I am trying to recover the shard, but if I try the recovery method the the lucene index from this page: http://www.jillesvangurp.com/2015/02/18/elasticsearch-failed-shard-recovery/
it tells me when i don't add the -fix to the end of the command that I'll lose all of the documents.

At the beginning of my project I didn't know how many documents I would have, but i knew it'd be alot. I want to offload many of the documents onto a different index, and optimize a bunch of documents by deleting unecessary ones.
But for now I need to get the failed shard back so I can do this.

How can I get the number of documents under the limit so the shard can atleast come up so I can delete documents and begin recovery and moving documents?

Can I use some kind of lucene command to delete a few of the most recent documents?
Or, maybe can I upgrade to a newer elasticsearch version that will allow the shard to come up?

Thanks,