Figure out shard size from size before creating Elasticsearch index in python

NLSVTN · June 17, 2021, 6:39pm

Hi,

We need to know the size of each shard before creating an index in ES. We are using Hail python library, and have MatrixTable that we convert to Table, and then write to Elasticsearch:

mt = mt.rows()
# Converts nested structs into one field, e.g. {a: {b: 1}} => a.b: 1
table = mt.drop('vep').flatten()
# When flattening, the table is unkeyed, which causes problems because our locus and alleles should not be normal fields. 
table = table.drop(table.locus, table.alleles)

hl.export_elasticsearch(table, ...)

Is there a way to figure it out? We are using AWS EMR, so I suppose we can issue query to know the parameters of the ES cluster or, in the worst case, just supply it to our python Hail script. But still I am not sure how to correctly compute it and whether it is possible at all.

warkolm · June 18, 2021, 2:38am

Can I ask why?
There's no tool that I know that can do this, other than indexing it into Elasticsearch.

NLSVTN · June 28, 2021, 9:03pm

Sorry for the late answer. We need it to optimize shard size: it should be below 50Gb. We need to assign a number of shards, right, when the index is created? So, how can we know it beforehand? Otherwise shard size could be anything. For instance an input file could be 100Gb but ultimate index size is 1.6Tb. Or input file can be 100Gb but output index - 900Gb. It can be different, so we can't just derive the number of shards from the size of the input file. We run a pipeline which adds a lot to initial file and then its written to ES.

warkolm · June 28, 2021, 9:45pm

Your best bet is to test with part of the data and extrapolate.

You can use split if you want to increase the shard count too.

system · July 26, 2021, 9:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimization of elastic search Elasticsearch	3	337	June 28, 2019
Getting shard sizes of an index from elasticsearch APIs Elasticsearch	2	808	July 5, 2017
How is data within an index distributed into shards? Elasticsearch	6	519	March 27, 2018
Cat shards API return different size on a same shard Elasticsearch	1	404	March 19, 2018
Shard size / Index number / server count and performance Elasticsearch	4	1412	July 6, 2017

Figure out shard size from size before creating Elasticsearch index in python

Related topics