What is the best way to distribute nodes and shards in Elasticsearch to achieve fast search while storing recent data on SSD and older data on HDD?

Hello,

I am storing events from different sources, and the data is continuous and reaches huge sizes. I want to implement an effective strategy for data retention.
I want to keep the last 14 days of data on SSD storage so that search performance is very fast.

I have an SSD server with a capacity of 75 TB, and I also have HDD storage with the same capacity, because the data is huge and is not deleted until after a considerable period.

To make the search faster, I came up with the following strategy:

SSD storage
Divide the SSD server into five nodes, each with the following specifications:

  • 64 GB RAM per node
  • 28 CPU cores per node
  • 15 TB SSD per node

This will be used to store the last 14 days of data for fast searches, and these nodes will serve as “hot” nodes.

HDD storage
I am thinking of dividing it as follows:
Five nodes, each with the following specifications:

  • 64 GB RAM
  • 32 CPU cores
  • 15 TB HDD

This will be used to store older data (more than 14 days) and these nodes will serve as “warm” nodes.

Master nodes
I will also create 3 dedicated master nodes with the following specifications:

  • 32 GB RAM
  • 16 CPU cores
  • 1 TB SSD

Note: In this strategy, each node will have a separate disk for the operating system, apart from the data disk.

Shard allocation
Now for shard and replica allocation:
I will create 3 shards for each index, each shard being 40 GB in size, and each shard will have 1 replica.

Per day, the data can reach up to 2 TB, so I create daily indices. The size per index will be about 120 GB, since we have 3 shards of 40 GB each.

I would like some help to make sure that this strategy is good for achieving very fast search performance.

  • Is this node division correct, or are there any suggestions?
  • Is this shard division optimal for achieving very fast search?

I have a few additional questions:

  • Are you indexing immutable data or do you also perform delets and/or updates?
  • How many indices are you actively indexing into?
  • How much data are you ingesting per day?
  • What type of searches are you running? Is it just aggregatioins or do you also run searches returning actual documents? If so, what is the typical result set size?
  • What is your definition of very fast search?
  • How many concurrent queries do you expect to need to serve? How many indices do these on average target?
  • When using a hot-warm architecture it is generally assumed that the most recent data is queried more frequently than warm data. Is that the case for your use case or do you always target the full time period?

Yes, the data is immutable, and if any change happens, it is only in some fields such as the “status” field. As for deletion operations, I do not perform any deletions.

The number of indices I actively index per day is 10, and this number may increase over time.

The amount of data I ingest per day is 1 TB.

The searches I perform involve aggregations and also return actual documents, with a typical size of 20,000.

I didn’t quite understand what you mean by “very fast search,” but I want the search to return results as fast as possible — for example, 1 second or less, or slightly more — the important thing is that it’s as fast as possible.

I don’t know exactly how many concurrent queries there will be, so you can assume the worst-case scenario. As for the indices, there are 10 per day, but searches may target data from the last 14 days. If there are 10 indices per day, that means over 14 days there will be 140 indices.

No, not always — the full retention period is one month, but I usually search only the last 14 days of data, while the full month’s data is only queried in some cases.

FYI: When you update an existing document — even if it’s just a single field — Elasticsearch reindexes the entire document with the same _id and an incremented _version. The previous version is marked as deleted. Strictly speaking, that means the data is no longer immutable. Frequent updates can affect search performance, as caching is less effective for rapidly changing data.

He also asked:

and got back:

LOL. He was replying to your own imprecise language to "make sure that this strategy is good for achieving very fast search".

It’s natural to aim for the best performance your hardware and budget allow. Most people do. Some will spend extra time on it, but you’re already ahead by considering this early - that is commendable.

I have a couple of questions too, mostly on the hardware side:

Do you really mean one, uno, a single server with some connectivity to 75TB of SSD storage, and presumably a lot of CPU and memory too (how much specifically?).

For the HDD storage, do you have a pool of servers with the spec you gave, and you're thinking to allocate 5 of them here?

Or are we talking more virtual here, the "servers" are going to be VMs, or some mix, or what?

If I understand correctly, you already have a solution, so you have a baseline. For your most important query or queries, how long do they take today, and on what hardware? Have you profiled them to see where the time is spent? Do they come anywhere close to the “1 second or less” target you mentioned?

1 Like

If any field can be updated the data is not immutable, so I think what you stated here is a contradiction. If data is updated, how would this be done? Can data of any age be updated?

If matching documents need to be retrieved this often requires a lot more random disk seeks. This is an area where the use of HDDs can have a significant negative impact on performance, both in terms of latency and query throughput.

You did specify very fast search. I was just asking what this really means.

In my experience query performance tend to get worse the more complex the query is and the more documents returned. Expecting large result sets and fast queries may or may not not be possible on the hardware you specified. You will need to perform some benchmarks yourself to determine what is and is not possible on your hardware and with your data, mappings and queries.

You need to size the cluster based on a certain expected latency and number of concurrent queries. No cluster can handle a worst-case scenario.

How large portion of queries only target the most recent data?

If the total amount of data ingested per day is 1TB and we assume you have 1 replica, that is 2TB of data on disk per day if we assume the raw data takes up the same amount of space on disk. With a 30 day retention that is a total of 60TB. Your hot nodes have a total of 75TB of SSD storage, so why do you need to move data over to a warm tier at all?

If you have this hardware available I would recommend you run some benchmarks based on real data, mappings and queries to see what is possible. I would recommend to start by adding data just to the hot nodes and see how just querying these works as more and more data is added. This should give you a feel for how querying against the most recent data will perform. Once you have added some data, test querying starting at just 1 concurrent query and then work yourself up until latencies are no longer acceptable.

Thanks for this observation, I didn’t notice it.

What if I don’t change anything in the data — would that make a difference?

Thank you for your compliment.

Yes, a single server with the following specs:

  • 1.5 TB RAM
  • 144 CPU
  • 75 TB SSD

I divide it into virtual environments with the specs I mentioned earlier.

As for the HDD, it’s QNAP storage, and I divide it into virtual environments as well, each with the specs I mentioned.

Yes, the “servers” will be virtual machines from the large physical server.

I have only tested on virtual machines so far, and I found that returning search results for 30,000 documents takes the same time as returning 10,000 documents — around 3 to 4 seconds — but I need the data to come back faster.

Sorry, the whole environment , maybe 5+5+3 VMs on that one big server?

Was this while only using the SSD storage or with a combination of both?

Are these local SSD disks or some form of networked storage based on SSDs?

As I mentined earlier I believe you will need to do some benchmarking with real data and queries. In order to explore what your hardware is capable of for your specific use case I would probably do something like this:

  1. Set up 10 hot nodes on the large SSD backed server. Each node would have almost 150GB RAM and 14 CPU cores. Set the heap to a reasonable value for the initial test. You want to keep it as small as possible, so 8GB or 16GB mnight be a good starting point.
  2. Set each index to have 1 replica shard and aim for a shard size of around 35GB or so. Then load data into the cluster so each node holds about 100GB of data (primary and replica shards). This should allow all data to eventually fit in the operating system page cache, which will allow querying that is not limited by disk I/O.
  3. Run real queries and return the result sets you are expecting to return. Monitor that the data set you have indexed actually returns the expected amount of data. You do want to compare the same size result set for all subsequent tests.
  4. If a single concurrent query with this setup gives acceptable latencies, ghradually increase the number of concurrent queries until latencies are no longer acceptable.
  5. Then add the same amount of data to the cluster again. Each node now hold around 200GB of data, which will not all fit in the page cache. Run the same set of query tests again. Rinse and repeat.

This should give you an idea of how your use case work on the hot nodes you have in place. If you fail on the first step you may need to test with smaller shard sizes.

1 Like

I have confirmed the matter: there is a service that has updates, but to solve this issue, I will store this service’s data in another database. Therefore, you can assume that the data will not undergo any updates or changes.

No, I will search the data on the SSD for the last 14 days, and I will only search the HDD in rare cases.

The expected response time is 1 second or less, and the concurrent queries range from a minimum of 30 to a maximum of 120 simultaneous connections.

The percentage of queries targeting the most recent data is 80%.

Your calculation in this regard is very accurate, but the data will not remain at 1 TB per day — it may increase to 2 TB after a month, and with replicas, the total will be 4 TB. Also, I am constrained by the server specifications and cannot create more virtual machines because the CPU has only 144 cores. If I create more nodes, I would have to reduce the number of CPUs per node, and the extra storage would be wasted. However, I am thinking about the future: data volumes will grow, and what matters most to me is search speed. I can even increase the RAM to 128 GB without any problem if it ensures faster performance.

If we assume each query returns 20000 results on average, a maximum of 120 consurrent queries returning in 1 second means 2.4 million documents being returned per second. As results are spread out across shards and indices that will likely result in a large amount of random disk reads.

You will need to answer this question. Am not sure what you are asking for is achievable even with local SSDs.

No, of course not all of them are on one server.
There are 5 virtual machines on the large SSD server for the hot data layer.
Additionally, there are 3 virtual machines for the master nodes on an SSD server separate from the large server.
As for the last 5 machines, which use HDD storage, they are on a QNAP system separate from the previous two servers.

No, I used HDD because I didn’t have an SSD server available. I created virtual machines from the HDD server, setting up 2 nodes and loading each node with the same data of the same size.

However, on the first virtual machine, I created an index with one shard and one replica. On the second virtual machine, I created the index with 3 shards, each having one replica, and then I executed the query.

When I run the query on 10,000 documents, the results are similar. But when I run the same query on 20,000 or 30,000 documents on the virtual machine with 3 shards, it returns the data faster than the 10,000 documents on the virtual machine with the index that has only 1 shard.

I would be very surprised if an HDD backed server would be suitable for this use case once a majority of data is no longer cached in memory. If the test you ran only used a small data set that all fits in the page cache and it still did not meet the required latency, running the same test on the SSD backed server may not give better results.

I would recommend testing with different shard sizes and run the query test on a number of nodes with enough RAM so the full data set fits in memory. This is your best case scenario as no I/O will be required. If you can not make this fast enough even when the data is all cached you will not be able to make it work at larger scale on either server when disk I/O will be required to serve requests.

Local disks

I will do this. Thank you for guiding me through these steps.

But should I consider my plan correct, or are there things I need to add, such as additional nodes, or are the current nodes sufficient for the amount of data?

And what I need to do to be able to query this data smoothly.

Local SSDs

First run a query test when all data fits in memory as that is your best case scenario (like I described). Once that gives you a shard size that works for a single concurrent query you can increase the number of concurrent queries until latencies grow too large.

It is quite possible that what you are aiming for is not at all possible given the hardware you have available. Once we know the results of the tests I described we will have a better idea whether it may be possible or not and how to best run further benchmarks.

SSD devices are considered fast; will they provide me with fast response times?

Okay, let's forget about the HDD for now, because I consider it just for archival purposes. What I’m talking about is the SSD, which should be fast for the last 14 days of data and also fast for queries.

Having the data cached in memory is faster than SSDs, so first make sure you can reach your objectives this way. This allows you to tune shard sizes and queries. If you are successful with this you can increase the data volume beyond what fits in RAM. Even though SSDs are fast this will likely gradually increase latencies.

You need to run benchmarks to be sure.

In my experience Elasticsearch is optimised for fast retrieval of reasonably small result sets as that is how a search engine generally operates. I have personally often struggled to return large result sets within the latencies you describe even when using SSDs. I personally suspect it is unlikely that you will be able to query up to 75TB of data within the latencies you describe at any high query concurrency even if the data is on SSDs. I can however not be sure, which is why I recommended benchmarking and testing.