Using ES for 1 PB of unstructured data

I'd like to use Elasticsearch to perform a full-text index of a large set of unstructured data, in the neighborhood of 1 PB. I estimate the actually index to be about 40% of this size, around 400 TB. There would be billions+ of items each containing text and metadata. I'm looking to gather some ideas about setting up a cluster that will have about 5 users searching across the dataset, and an acceptable query time of 30,000 - 60,000 ms. This index would be completely static in terms of growth, after all the data has been processed, nothing more would be added to it.

From everything I've learned through reading on Elastic and the Core Operations class, this seems like a complex challenge as it would take several thousand shards to keep them under 70 GB in size each (not including replicas), and A LOT of systems running multiple nodes on each. Is this realistic and possible? Can this be achieved with fewer resources by having fewer, larger shards? Would it make sense in a scenario like this to create a bunch of smaller indices or even multiple clusters - I'm not sure the answer to this because in the end, it would be the same 400 TB that needs to be searched.

Any additional thoughts or information that anyone can provide would be fantastic! I am relatively new to Elasticsearch and I want to make sure I'm not missing any major and understanding the basic concepts correctly.

I'm sure others have further input here. As a general getting started you might look here:

Your use case sounds really interesting. Care to share some more background?

Hope this helps for perspective,

Thanks @mainec! I'll watch these as soon as I can.

Essentially the source data exists on file shares (SAN and NAS back-ends) and I'm looking to use a binary-level extraction tool that can read the source data and make a "copy" of the text/metadata directly in ES. The tool is only part of the scenario as it simply uses its engine to transform the data into ES friendly JSON records, but it doesn't do anything with the architecture/configuration of ES. The idea is to have the data available in ES so that I can search across the newly indexed data set at a later time.

What you want to do is certainly possible but it won't be easy or cheap. You do have a couple advantages though with a static data set and relatively generous query response requirements but you're still talking about a large cluster no matter how you slice it. Even if you reduce it to 400TB you still have to consider replicas which could push you back into the PB range depending on your requirements.

So many considerations here that really depend on exactly what the data looks like and what capabilities the cluster needs to be able to provide. For instance if you need to run aggregations on that much data you'll have very different considerations (and expense) vs. just basic searches.

So you need to start with the data and think about what it looks like on a per doc basis, how it needs to be indexed and how it needs to be searched. From there you can start to project sizing numbers and do index design. In particular consider what logical partitions exist in the data so that you can segment it into smaller indices. If you have the luxury of it being time series data definitely leverage that, otherwise consider how else it can be partitioned. A single 400TB index will be a nightmare to manage and you'll be a very unhappy camper if you were to say lose a shard out of that. Smaller indices give you much more flexibility.

Whether you can do this with a single cluster or not will also depend on exactly what the sizing numbers come out to. You can't run 100TB on a single node so you have to find the right node size for your use case and then project the number of nodes required. If it starts to look like you're going to need hundreds of nodes (and if you use SSDs you probably will) then you may need to consider multiple clusters. Also be sure that any sizing calculation factors in overhead for cluster operations. If you need 1PB of space just for data then you need to factor in overhead on a per node basis to allow for things to stay functional when there are issues in the cluster. The details of that will depend on exactly how many nodes you have and the ultimate size of shards. This is hard to get right.

Once you have an estimate on what you think a node will look like then you should test a single node configuration, load it with a representative amount of data and then figure out if it can handle your query response requirements. ES queries scale surprisingly well as you add nodes but if your nodes are the wrong size everything can fall apart. As you do this you'll also encounter some of the limits that ES has when dealing with truly large amounts of data. Really explore how much data you can actually put on a node, this is heavily affected by the number of shards on the node. ES works best at lower density but with your query requirements you may be able to run higher density if you can size things so it doesn't run out of heap just starting up. That means fewer, larger shards. The fact your dataset is static could really help you here.

Anyway, it's impossible for me to give you any real answers so hopefully this rambling is of some help.


1 Like