We are trying to write 42522080 documents into Elasticsearch in python using Hail:
mt = hl.import_vcf(dataset_path, reference_genome='GRCh' + genome_version, force_bgz=True, min_partitions=500)
mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle=True)
...
variant_count = mt.count_rows()
logger.info("\n==> exporting {} variants to elasticsearch:".format(variant_count))
row_table = mt.rows().flatten()
row_table = row_table.drop(row_table.locus, row_table.alleles)
hl.export_elasticsearch(row_table, ...)
And we are getting the error:
hail.utils.java.FatalError: EsHadoopException: Could not write all entries for bulk operation [93/1000]. Error sample (first [5] error messages):
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
Bailing out…
We are not sure why it reaches Lucene's capacity when the actual number of documents that we are trying to create is just 42522080 which we verified (see line with variant_count above). Does it mean that the capacity is applied to inner fields somehow too? The index which has 2 fields less per document was created successfully for us. How could we avoid the issue, what could we try?
(Also asked the question on the Hail forum: Updating index gives number of documents in the index cannot exceed 2147483519 - Hail Query & hailctl - Hail Discussion)