We need to know the size of each shard before creating an index in ES. We are using Hail python library, and have MatrixTable that we convert to Table, and then write to Elasticsearch:
mt = mt.rows()
# Converts nested structs into one field, e.g. {a: {b: 1}} => a.b: 1
table = mt.drop('vep').flatten()
# When flattening, the table is unkeyed, which causes problems because our locus and alleles should not be normal fields.
table = table.drop(table.locus, table.alleles)
hl.export_elasticsearch(table, ...)
Is there a way to figure it out? We are using AWS EMR, so I suppose we can issue query to know the parameters of the ES cluster or, in the worst case, just supply it to our python Hail script. But still I am not sure how to correctly compute it and whether it is possible at all.
Sorry for the late answer. We need it to optimize shard size: it should be below 50Gb. We need to assign a number of shards, right, when the index is created? So, how can we know it beforehand? Otherwise shard size could be anything. For instance an input file could be 100Gb but ultimate index size is 1.6Tb. Or input file can be 100Gb but output index - 900Gb. It can be different, so we can't just derive the number of shards from the size of the input file. We run a pipeline which adds a lot to initial file and then its written to ES.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.