I understand from what I have read that ES works best in conjunction with a sweet spot of 64 GB RAM per node and a fair bit of SSD (3-4 TB per node, with multiple shards in each node to handle primary copies and replicas). I have a scenario where we have a total data of around 5PB that needs to be searchable by simple queries (akin to a plain search engine for now, no special needs like aggregation, etc.). So, assuming one backup for each shard, the total data storage I need would be around 10 PB. Assuming each node to be 4 TB in size (2 shards in each, each shard occupying around 2 TB storage), we will be needing around 2500 such nodes. I have two questions:
For the storage, if we use HDD instead of SSD, what kind of performance hit will we be likely observe? On average run time searches being an order of magnitude slower? Or most being at par with SSDs with the odd query being slower, like we see here? Or what other way can the HDD usage in place of SSD hit the perf of overall stack?
Same question as above, if we are juggling with 32 GB RAM per node instead of 64 GB?
Is there any sense of newer / recent data versus older / less searched data.
What is the SLA for the search?
The 64 GB, 8-16 CPU 2-4 TB SSD is generally what we term as Hot with very fast search access, for argument 10s of ms
You may not need that... Or may not need that for all your data.
You can build hot nodes with 10TB of data it depends on many factors. How you organize your data, the searched, the query SLA.
Are you familiar with data tiering...
On the other end of the spectrum there is a concept of Searchable Snapshot and a Frozen Data Tier (terrible name) where a single frozen data node can access 100TB of data stored on S3 compatible storage. Much more data but searches will be slower.. simple searches work pretty well, perhaps that is an option.
This is a commercial paid feature, not trying to sell you just with Very Large data sets we often see Lower TCO using commercial features.. that lowers the HW footprint dramatically.
5PB is non trivial, if you are not an expert you might want to contact a Solutions Architect via Sales...I suspect they will be happy to chat with you.
If Basic / Free are best for you , they will tell you that.
I am sure there will be so.e good suggestions here but a 5PB dataset is a serious undertaking.
We don't have concepts of hot and cold, newer/older right now. All normal data. As for 10 ms perf, let us say we want this 5 PB to be searchable through normal querying with perf similar to Google.
For that, how much will the impact of SSD/HDD and 64 GB/ 32 GB RAM be? Will the impact be pronounced often? Or only in spikes? Or only when the search query rate is very high to handle for an HDD/32 GB RAM to service but will work with 10 ms - 1 sec latency most of the time? Google or Bing often takes upto 1 second to show search results, so let us assume that our initial target is that - 1 second. For that to be achieved, against 5 PB of data, is 64 GB of RAM needed along with SSD? Or 32 GB of RAM will work well? What about HDD vs SSD?
What type of data is this? How the raw data size translates into size on disk will depend a lot on the type of data and the mappings used. If this is machine generated data a 1 to 1 ratio may be a good initial assumption but if this is binary documents where you typically extract the text and index this and store the binary source document somewhere else it can be very different.
If you basically search all data for every query and want to have quite dense nodes in terms if indexed data volume per node I would expect HDDs to perform a lot worse than SSDs with respect to both latency of a single query and query throughput. Elasticsearch is I/O intensive, especially when a node holds more data than can fit in memory, and is often limited by storage performance. This is why using SSDs is one of the first recommendation for this type of search use cases.
With respect to RAM more generally always help. As long as you have enough heap to handle the data and support queries a 16GB heap and 32 GB RAM may be sufficient to support the node density you desire. I would expect this use case to be limited by disk I/O and IOPS.
The best way to find out is to run a realistic benchmark with real data and queries.
Thanks. This is just plain CommonCrawl web data - normal web article content will be indexed in ES. We don't see any need of separately storing the raw web-content anywhere, it will discarded after it is ingested in ES and indexed internally.
"As long as you have enough heap to handle data and support queries" - What does this mean? The data per node is 3-4 TB, divided into 2 shards each, would 32 GB RAM be able to yield decent perf on top of that?
The minimum query latency will depend on mappings and the query but also grow with increased shard size. An ideal shard size may be between 10GB and 50GB (depends on data, mappings and use case) so with 3TB per node each node would hold between 60 and 300 shards. The amount of heap used vary with the amount of data stored on the node and will vary depending on mappings used, document size and whether indices are constantly indexed into or updated, which may prohibit some optimisations.
As long as the heap is large enough to handle the overhead of data stored on the node and any indexing/query load without suffering from long or frequent GC I would expect RAM to not be the limiting factor with respect to performance. As every search will likely need to access every shard there will be a lot of smallish spread out read requests that will require a lot of IOPS to serve quickly. This is why SSDs are far superior to HDDs, which typically lack in this area.
I think that the best way to answer a couple of your question is to do a proof of concept with real data and usage as Christian already mentioned.
Keep in mind that you won't be able to use the entire size of the data disk, Elasticsearch has a couple of watermark that will start triggering if the disk gets full, per default for a 4 TB disk it will allow you to use 3.4 TB, this can be tuned but I would not count to be able to use more than 3.8 TB using the default shard size (something between 50 GB).
Also, if you are planning to use a Cloud Service like AWS, Azure or GCP, a 4 TB SSD disk will cost you more than the 64 GB VM, I have a couple of hot nodes with this configuration and the disk costs more than the VM.
Not sure what you are planning to spend on this, but with 2000 such nodes it would be something close to tens of millions per month.
I would follow Stephen suggestion and contact Elastic to talk with a Solutions Architect, 10 PB of data, considering replicas, is really a lot of data, and having all this without data tiering can be really, really expensive.
Thanks. I understand the expensive part, but how can ES solutions architect help me? Sure, data tiering may make it more efficient, also reduce some costs here and there, but the tens of millions USD will not come down to 100k USD, right?
If there are 60-300 shards per node, going with the logic that shards are there to primarily index data and store it as backups as well, which would not work if a primary shard and its replica are on the same node, then are we not looking at a huge number of nodes to ensure an even distribution of all shards and multiple replicas, far beyond the "1 primary and 1 replica per node" model I was looking at?
They can understand your scenario and propose what would be better taking cost and performance in consideration.
For example, maybe HDD will work for you, maybe you can achieve a good responde time for your use case using the frozen data tier with data stored in S3.
As already said, you need to do a Proof of Concept to find the best approach.
It was just a suggestion not a requirement...if you get all the answers here / elsewhere and you have no need for data tiers or commercial features no need.
As was mentioned you will need to do some pretty extensive POCing.
I am not sure what you are referring to. A shard is the way Elasticsearch organises data. Each shard is basically a data store (Lucene instance) and you can index data into it and it then stores it on disk and also serves queries against this data. For high availability, but also performance, you typically have a primary shard with at least one replica shard. If you lose the primary the replica can be promoted to primary and continue to handle indexing and querying of that shard.
When you issue a query to Elasticsearch it identifies the shards that need to be queried and fans out queries to either the primary or replica of every shard involved in the query. Each search against a specific shard is single-threaded (which is why larger shards result in slower queries) but queries against different shards can naturally be procesed in parallel on each node.
I do not understand what you are referring to here. Could you please elaborate?
Thanks. My initial plan was - we would store around 4 TB of data per node (thus 2500 nodes in total) - each node having two shards of 2 TB each - one primary and one secondary shard (the primary shard for that secondary shard is naturally in some other node). That way, it could work. But now you are suggesting that each shard should not exceed say 50 GB in size, that means, on one such node, there will hundreds of such shards. Since primary and backup shards should not reside in the same node, those shards will be a mix of primary shards and replica shards of primary shards other nodes. This same model is to be replicated on all nodes for uniformity, right? Doesn't that mean that the total number of replicas will be far more than the 1 replica I was planning, requiring more than even the 10 PB data I had roughly calculated?
The shards are held in the cluster as a whole. Each node will hold a mix of primary and replica shards, but the overall shard size (assuming they are still reasonably large as in the range I quoted) will not affect the number of nodes required. You are still storing 2 copies of all data, just organised into smaller shards.
The cheapest possible way to store 5PB of searchable data with Elasticsearch would be to use S3 (IA tier) and access it via searchable snapshots, for which you're talking ~$50k per month in storage fees (plus some amount of access fees which depend very much on your usage pattern). You could kinda theoretically search all that with a tiny cluster if you don't mind waiting, but clearly something which performs better is going to cost more. How much more is an open question, and possibly not one we're going to be able to answer precisely on this (free!) community forum.
We do have some users with datasets of this size but they tend to get there with a lot of very careful tuning.
Hi folks, one last question, how much additional space does the index occupy, on average? Say I have 4 TB machines, consisting of 2 shards each, and in each of these shards, I pump in 2 TB of raw text for ES to index. Obviously that index will also take some space, may be divided between memory and hard disk. Is there a ballpark figure for how much additional space the index can occupy on hard disk (in addition to being stored in 64 GB of RAM) when 2 TB of data is indexed in ES? In the same vein, if we are indexing images through vectors in ES, and suppose the image sizes are anything between few kB to a few MB. So will the total space required be the sum total of the space required to store both images and indexes separately? Or will the ES create its index and then the images need not be stored separately but only its URL, and the index will be enough to point to the URL of the image that best matches the query?
Please do not create shards this big. Try to stick to the official recommendations, and if you are going to divert from these, make sure you run a benchmark to see that it works for you. Shards is the unit that is moved around when nodes fail and data need to be relocated and querying huge shards like this is likely to be very slow. Very large shards can therefore have a negative impact on cluster stability as well as performance.
The ratio between the raw data size and the size it takes up on disk depends a lot on the use case and the mappings used. If it is a search use case data is often analyzed and indexed in multiple ways which can take up more space on disk compared to other use cases, e.g. logging. The best way to find out is to index a significant volume of realistic data with the mappings to be used and measure it that way. As mappings are flexible the ratio can vary quite a lot, so it is hard to give an accurate estimate.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.