Updating index gives number of documents in the index cannot exceed 2147483519

NLSVTN · May 20, 2021, 1:48pm

We are trying to write 42522080 documents into Elasticsearch in python using Hail:

    mt = hl.import_vcf(dataset_path, reference_genome='GRCh' + genome_version, force_bgz=True, min_partitions=500)
    mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle=True)

    ...

    variant_count = mt.count_rows()
    logger.info("\n==> exporting {} variants to elasticsearch:".format(variant_count))

    row_table = mt.rows().flatten()
    row_table = row_table.drop(row_table.locus, row_table.alleles)

    hl.export_elasticsearch(row_table, ...)

And we are getting the error:

hail.utils.java.FatalError: EsHadoopException: Could not write all entries for bulk operation [93/1000]. Error sample (first [5] error messages):
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
Bailing out…

We are not sure why it reaches Lucene's capacity when the actual number of documents that we are trying to create is just 42522080 which we verified (see line with variant_count above). Does it mean that the capacity is applied to inner fields somehow too? The index which has 2 fields less per document was created successfully for us. How could we avoid the issue, what could we try?

(Also asked the question on the Hail forum: Updating index gives number of documents in the index cannot exceed 2147483519 - Hail Query & hailctl - Hail Discussion)

Christian_Dahlqvist · May 20, 2021, 2:25pm

If you have nested documents each will be stored as a separate document in Lucerne behind the scenes. A single document can therefore result in multiple Lucene documents. Note that the limit is per shard and not per index so you can get around this by increasing the number of primary shards.

NLSVTN · May 21, 2021, 2:36pm

Is there a way to determine how many primary shards are needed for a total number of documents?

Christian_Dahlqvist · May 21, 2021, 3:40pm

What is the average number of nested documents per indexed document?

NLSVTN · May 21, 2021, 3:56pm

I see there 3 nested fields: one has 7 array fields, another - 26, third - 5. Each array field can have many values. I suppose the answer is 3 then? But (3 + 1) * 42522080 is below the upper limit.

Christian_Dahlqvist · May 21, 2021, 4:25pm

If a nested field contains an array with 7 JSON objects, that is 7 internal documents in addition to the main one. Your document could therefore require 1+7+26+5=39 internal documents. If you have multiple levels of nesting this naturally increases.

NLSVTN · May 21, 2021, 4:30pm

I think I got it. Basically all my 3 nested fields are arrays with the objects that contain the specified number of fields. So, it seems now that I just should know the length of each of the 3 arrays, not the number of fields in the objects that are contained within them. If it is true, then I see why it happens: arrays can have hundreds of elements in them.

Christian_Dahlqvist · May 21, 2021, 4:31pm

Yes, that sounds correct.

system · June 18, 2021, 4:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to execute index. number of documents in the index cannot exceed 2147483519 Elasticsearch	2	2724	July 21, 2020
Elasticsearch throwing number of documents in the index cannot exceed 2147483519 Elasticsearch	3	11859	August 21, 2020
Lucene max documents limit Elasticsearch	2	3607	July 5, 2017
Error for too many documents, API says we are still below the limit Elasticsearch	1	542	December 17, 2019
ElasticSearch- IndexReaders cannot exceed 2147483647 Elasticsearch	7	1458	July 6, 2017

Updating index gives number of documents in the index cannot exceed 2147483519

Related topics