This blog post, while old, is still about right. TLDR disk usage will often be within a factor of 2 of the input data size (either way) depending on the data and the mappings
Don't store large binary data like images in ES, it's a waste of your valuable in-cluster resources. Put them somewhere cheaper, with a link stored in ES which points to the binary data. You can index them if you want to use vector search, just don't store them there. See store | Elasticsearch Guide [8.11] | Elastic for a little more info on the difference.
Thanks David. Yes, that sounds like a good idea. We can store the images in S3 or some other storage and do just the indexing, classifying elsewhere. Just curious - the accompanying Medium article (for some reason I am unable to paste the link here, the post is flagged) seemed very promising. This was obviously a basic idea, to make a proper image search engine, we have to employ a combination of sophisticated NLP, vector feature extraction, self-supervised classification, random forests, etc. Will all this be possible by just running the classification, feature extraction, etc. on ES while the actual images are stored in a different server, like S3, for serving to the end user?
Thanks. I won't. I will stick to 50 GB shard sizes as you and others have explained in this post. I was using this just as an example. My concern is that server with SSD and high RAM (~64 GB) is very expensive, so I need a way to have an idea of how much space I might end up needing to index 5 PB of data (plus its replicas, assuming only one replica for now), so that I can estimate the potential cost of such an endeavour.
Not sure, this is outside my area of expertise. I would hope so, but you might do better to open a separate thread on this question because the experts in this area have probably stopped reading this thread by now.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.