Updating index gives number of documents in the index cannot exceed 2147483519

We are trying to write 42522080 documents into Elasticsearch in python using Hail:

    mt = hl.import_vcf(dataset_path, reference_genome='GRCh' + genome_version, force_bgz=True, min_partitions=500)
    mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle=True)

    ...

    variant_count = mt.count_rows()
    logger.info("\n==> exporting {} variants to elasticsearch:".format(variant_count))

    row_table = mt.rows().flatten()
    row_table = row_table.drop(row_table.locus, row_table.alleles)

    hl.export_elasticsearch(row_table, ...)

And we are getting the error:

hail.utils.java.FatalError: EsHadoopException: Could not write all entries for bulk operation [93/1000]. Error sample (first [5] error messages):
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
Bailing out…

We are not sure why it reaches Lucene's capacity when the actual number of documents that we are trying to create is just 42522080 which we verified (see line with variant_count above). Does it mean that the capacity is applied to inner fields somehow too? The index which has 2 fields less per document was created successfully for us. How could we avoid the issue, what could we try?

(Also asked the question on the Hail forum: Updating index gives number of documents in the index cannot exceed 2147483519 - Hail Query & hailctl - Hail Discussion)

If you have nested documents each will be stored as a separate document in Lucerne behind the scenes. A single document can therefore result in multiple Lucene documents. Note that the limit is per shard and not per index so you can get around this by increasing the number of primary shards.

1 Like

Is there a way to determine how many primary shards are needed for a total number of documents?

What is the average number of nested documents per indexed document?

I see there 3 nested fields: one has 7 array fields, another - 26, third - 5. Each array field can have many values. I suppose the answer is 3 then? But (3 + 1) * 42522080 is below the upper limit.

If a nested field contains an array with 7 JSON objects, that is 7 internal documents in addition to the main one. Your document could therefore require 1+7+26+5=39 internal documents. If you have multiple levels of nesting this naturally increases.

1 Like

I think I got it. Basically all my 3 nested fields are arrays with the objects that contain the specified number of fields. So, it seems now that I just should know the length of each of the 3 arrays, not the number of fields in the objects that are contained within them. If it is true, then I see why it happens: arrays can have hundreds of elements in them.

Yes, that sounds correct.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.