Figure out shard size from size before creating Elasticsearch index in python


We need to know the size of each shard before creating an index in ES. We are using Hail python library, and have MatrixTable that we convert to Table, and then write to Elasticsearch:

mt = mt.rows()
# Converts nested structs into one field, e.g. {a: {b: 1}} => a.b: 1
table = mt.drop('vep').flatten()
# When flattening, the table is unkeyed, which causes problems because our locus and alleles should not be normal fields. 
table = table.drop(, table.alleles)

hl.export_elasticsearch(table, ...)

Is there a way to figure it out? We are using AWS EMR, so I suppose we can issue query to know the parameters of the ES cluster or, in the worst case, just supply it to our python Hail script. But still I am not sure how to correctly compute it and whether it is possible at all.

Can I ask why?
There's no tool that I know that can do this, other than indexing it into Elasticsearch.

Sorry for the late answer. We need it to optimize shard size: it should be below 50Gb. We need to assign a number of shards, right, when the index is created? So, how can we know it beforehand? Otherwise shard size could be anything. For instance an input file could be 100Gb but ultimate index size is 1.6Tb. Or input file can be 100Gb but output index - 900Gb. It can be different, so we can't just derive the number of shards from the size of the input file. We run a pipeline which adds a lot to initial file and then its written to ES.

Your best bet is to test with part of the data and extrapolate.

You can use split if you want to increase the shard count too.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.