Unable to load an index with more than 2.1 B documents

I stopped the elasticSearch service and restarted it. Now, one of the indices cannot be loaded. Here's the status of the indeices:

    yellow open test-index        _taPEhm3ReSwCYjb1Y5hQA 1 1         3        0  7.5kb  7.5kb
    yellow open wiki-index-000001 aaZvlpgJSuO43uMGKuyqKw 1 1         0        0   208b   208b
    yellow open wiki-index-unique s1tU2HpnStWNobkZjQ7KQA 1 1 118098273 51827014 36.8gb 36.8gb
    yellow open corpora-index     ruArLOJoSv6HkKVBv-o1MA 1 1 289045137        0 47.3gb 47.3gb
    red    open corpora           -86nIPPwS8K5IOpFYNcXBQ 1 1                                 
    yellow open simple_bulk       6N8NCLd5S5qOKf9o6R6YhA 1 1         6        0  9.6kb  9.6kb
    yellow open test1             SN8ViALMRNGHkBF7o8-3zw 1 1         2        1  5.3kb  5.3kb
    yellow open simple-index      zYctGNhNRGWrnCOYpKHBcQ 1 1         1        0  4.5kb  4.5kb

When I look at the log file of the cluster, I find this error:

    [2020-11-20T23:32:30,045][INFO ][o.e.i.s.IndexShard       ] [ilcompn0] [corpora][0] ignoring recovery of a corrupt translog entry
    java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519
            at org.apache.lucene.index.DocumentsWriterPerThread.reserveOneDoc(DocumentsWriterPerThread.java:211) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8dd$
            at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:232) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8$
            at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:419) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e -$
            at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1333) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ivera $
            at org.apache.lucene.index.IndexWriter.softUpdateDocument(IndexWriter.java:1661) ~[lucene-core-8.6.2.jar:8.6.2 016993b65e393b58246d54e8ddda9f56a453eb0e - ive$
            at org.elasticsearch.index.engine.InternalEngine.updateDocs(InternalEngine.java:1260) ~[elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1091) ~[elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:935) ~[elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:819) ~[elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:791) ~[elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1526) ~[elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1557) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.IndexShard.lambda$openEngineAndRecoverFromTranslog$9(IndexShard.java:1605) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:488) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:463) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:125) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1610) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:436) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:98) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:96) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1883) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710) [elasticsearch-7.9.1.jar:7.9.1]
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.9.1.jar:7.9.1]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
            at java.lang.Thread.run(Thread.java:832) [?:?]

Is there any suggestion on how I can rescue the index?

Welcome!

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

I don't think you can do anything but reindexing.

Thank you! Here's the output of GET /:

{
  "name" : "myserver",
  "cluster_name" : "mycluster",
  "cluster_uuid" : "3MYZpmHQSuag-x1OmbIq2Q",
  "version" : {
    "number" : "7.9.1",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "083627f112ba94dffc1232e8b42b73492789ef91",
    "build_date" : "2020-09-01T21:22:21.964974Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

the output of GET /_cat/nodes?v:

ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
12.1.25.31           36          57   1    1.12    1.14     1.20 dilmrt    *      myserver

the output of GET /_cat/health?v:

epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1605949261 09:01:01   mycluster red             1         1      7   7    0    0        9             0                  -                 43.8%

the output of GET /_cat/indices?v:

health status index             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test-index        _taPEhm3ReSwCYjb1Y5hQA   1   1          3            0      7.5kb          7.5kb
yellow open   wiki-index-000001 aaZvlpgJSuO43uMGKuyqKw   1   1          0            0       208b           208b
yellow open   wiki-index-unique s1tU2HpnStWNobkZjQ7KQA   1   1  118098273     51827014     36.8gb         36.8gb
yellow open   corpora-index     ruArLOJoSv6HkKVBv-o1MA   1   1  289045137            0     47.3gb         47.3gb
yellow open   simple_bulk       6N8NCLd5S5qOKf9o6R6YhA   1   1          6            0      9.6kb          9.6kb
red    open   corpora           -86nIPPwS8K5IOpFYNcXBQ   1   1                                                  
yellow open   test1             SN8ViALMRNGHkBF7o8-3zw   1   1          2            1      5.3kb          5.3kb
yellow open   simple-index      zYctGNhNRGWrnCOYpKHBcQ   1   1          1            0      4.5kb          4.5kb

The two billion document limit is not per index but per shard, so you might try to use the split index API to increase the number of primary shards and thereby capacity. If this is not an option or does not work, e.g. if you are on an older version of Elasticsearch, you may need to reindex into an index with a larger number of primary shards.

I split the index by first running this command:

curl -X PUT "localhost:9200/corpora/_block/write?pretty"

With the following output:

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "indices" : [
    {
      "name" : "corpora",
      "blocked" : true
    }
  ]
}

then running this command:

curl -X POST "localhost:9200/corpora/_split/corpora-split?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.number_of_shards": 5
  }
}
'

Here's the output of it:

{
  "acknowledged" : true,
  "shards_acknowledged" : false,
  "index" : "corpora-split"
}

And this is the output of GET /_cat/indices?v:

yellow open   corpora2          4XF6lD5vSRSHJ6Jub3q0Yg   5   1          0            0        1kb            1kb
yellow open   test-index        _taPEhm3ReSwCYjb1Y5hQA   1   1          3            0      7.5kb          7.5kb
yellow open   wiki-index-000001 aaZvlpgJSuO43uMGKuyqKw   1   1          0            0       208b           208b
yellow open   wiki-index-unique s1tU2HpnStWNobkZjQ7KQA   1   1  118098273     51827014     36.8gb         36.8gb
yellow open   corpora-index     ruArLOJoSv6HkKVBv-o1MA   1   1  289045137            0     47.3gb         47.3gb
yellow open   simple_bulk       6N8NCLd5S5qOKf9o6R6YhA   1   1          6            0      9.6kb          9.6kb
red    open   corpora           -86nIPPwS8K5IOpFYNcXBQ   1   1                                                  
red    open   corpora-split     EH_mLnZ_QQaD9JQ9jADAaA   5   1                                                  
yellow open   test1             SN8ViALMRNGHkBF7o8-3zw   1   1          2            1      5.3kb          5.3kb
yellow open   simple-index      zYctGNhNRGWrnCOYpKHBcQ   1   1          1            0      4.5kb          4.5kb

Both "corpora" (i.e., old index with 1 shard) and "corpora-split" (i.e., new index with 5 shards) are not loaded properly.

Yes, you can't split the index until it's in at least yellow health.

I think the only (edit: see below) way forward here is to shut the node down and run bin/elasticsearch-shard remove-corrupted-data --truncate-clean-translog on this shard (docs) since the failure is occurring when replaying operations from the translog. This will lose any recent operations. The shard should then recover when you restart the node.

edit: a better way forward would probably be to restore a recent version from a snapshot, then split that.

That is probably due to it being in a red state. Not sure if there is any way to recover that.

Does elastisearch 7.9.1 make any automatic snapshot by default? I have not made one manually.

If you are running it as a managed service on Elastic Cloud snapshots are taken automatically, but this is not the case if you are hosting it yourself. Snapshots have to be setup and configured as they require some form of shared storage across the cluster, e.g. shared filesystem, S3, HDFS etc.

To remove the corrupted data, I first stop the elasticSearch service by running the following on the server hosting elasticSearch service:

sudo systemctl stop elasticsearch.service

Then I trye to remove the corrupted data by running this:

sudo /usr/share/elasticsearch/bin/elasticsearch-shard remove-corrupted-data --index corpora --shard-id 0

However, it returns this output:

ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
------------------------------------------------------------------------

    WARNING: Elasticsearch MUST be stopped before running this tool.

Exception in thread "main" java.io.IOException: failed to obtain lock on /mnt/myserver/user/pouranbe/corpora/extract/elasticsearch/data/nodes/0
	at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:223)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processNodePaths(ElasticsearchNodeCommand.java:152)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:176)
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:91)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
	at org.elasticsearch.cli.Command.main(Command.java:90)
	at org.elasticsearch.index.shard.ShardToolCli.main(ShardToolCli.java:35)
Caused by: java.nio.file.AccessDeniedException: /mnt/myserver/user/pouranbe/corpora/extract/elasticsearch/data/nodes/0/node.lock
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)
	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:125)
	at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
	at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
	at org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:216)
	... 8 more

The owner of the directory /mnt/myserver/user/pouranbe/corpora/extract/elasticsearch/data is elasticsearch:

drwxr-xr-x 3 elasticsearch elasticsearch      4096 Sep 20 21:38 data

The user that I use to run these commands has root access. Do you think Is there anything I'm missing here?

You should run them as the elasticsearch user.