Elasticsearch design feasibility

mr_zer0 · October 9, 2023, 6:04am

Hello All,

Would need your help in validation of few numbers, based on your experience.
Would be happy to get a response or some feedback.

I wanted to check if the below numbers are feasible to be handled by Elasticsearch.

100k documents per second
8.5 billion + documents in a day
Kafka topics will feed data. Maximum delay of 3 mins between data generation and for it to be consumed by elasticsearch.
3 primary/1 replica shard setup

Queries:2x the number of records inserted in a second. 200k or 300k per second on an average.

Based on above figures, how much would the data size be per day? I assumed it would be around 2000Gb, but is it more or less?
I had used elasticsearch sizing guides and arrived at below numbers

2000 GB * 7 days * ( 1 + 1 ) = 28000 GB ######## 28000 GB * 1.25 = 35000 GB ########## 35000/ 64 GB/ 30 + 1 = 19 hot nodes

2000 GB * 180 days * ( 1 + 1 ) = 720000 GB ######## 720000 GB * 1.25 = 900000 GB ####### 900000/ 64 GB / 160 + 1 = 88 nodes

(increased ratio of 50 and 250 )

2000 GB * 7 days * ( 1 + 1 ) = 28000 GB ######## 28000 GB * 1.25 = 35000 GB ########## 35000/ 64 GB/ 50 + 1 = 11 hot nodes

2000 GB * 180 days * ( 1 + 1 ) = 720000 GB ######## 720000 GB * 1.25 = 900000 GB ####### 900000/ 64 GB / 250 + 1 = 57 warm nodes

These numbers look too high but if its correct decided on below hardware.

CPU – minimum 32 cores per machine

RAM – 64 GB

11 hot nodes ---------------- Disk – 3 TB SSD per node

57 warm nodes -------------- Disk – 16 TB SSD per node

5 Master nodes(1.9 TB/2 TB storage not required on these nodes)

Does the above design of elasticsearch make sense? Unfortunately, i am unable to setup a small instance and test it out, so have to rely on estimates only.
Can we use compression for data older than 7 days? How do we do it? I do not want the current data to be compressed. According to online guides, this would result in 10-20% storage gains.
Suggestion: The 32 GB heap limit is too restrictive. It needs to be a 64 bit java process for such cases. Is it still the case of there is a 64 bit version or multiple parallel 32 G processes in pipeline?

Let me know if anything is needed.

Christian_Dahlqvist · October 9, 2023, 6:33am

How did you arrive at this? Are you indexing into a single index or a set of indices? Is your data immutable? Are you planning on using rollover/data streams?

Can you please provide more details about the use case and the nature and purpose of these queries. How targeted, in terms of indices, are these queries?

This sounds like a lot, lot higher than what is generally assumed for a hot-warm architecture, so I seriously doubt the standard way of sizing hot-warm architectures apply here.

mr_zer0 · October 9, 2023, 9:00am

Thanks for the response Christian.

How did you arrive at this? Are you indexing into a single index or a set of indices? Is your data immutable? Are you planning on using rollover/data streams?
This is our standard shard setup for all deployments. We will be deploying into multiple indexes, created for each day, for each topic lets say. Data is immutable. Not aware about streams.

Can you please provide more details about the use case and the nature and purpose of these queries. How targeted, in terms of indices, are these queries?
Data is immutable, we are not looking to update any documents once its inserted. Assuming 100k insertions per second, we are looking to read those 100k documents atleast once to search for failed cases based on a field, then use that for further analysis. And another 100k to 200k assuming we report those data in two reports/visualizations.

This sounds like a lot, lot higher than what is generally assumed for a hot-warm architecture, so I seriously doubt the standard way of sizing hot-warm architectures apply here.
Yes this looks too high to be possible in elasticsearch. I am also checking if Apache spark or hive can be used for this.

Christian_Dahlqvist · October 9, 2023, 9:07am

Are you fetching documents by ID or running queries for specific fields? When you are searching, will you know which index the data resides in or will you need to search all indices?

You are talking about reading 100k documents, but I am not sure how this translates into 100k+ queries per second. Can you please elaborate?

mr_zer0 · October 9, 2023, 9:25am

We will be fetching by specific fields for time, various other filters. it will search only specific indexes, not all the incides.

lets assume 10 documents each with its own time, customer details, detail1,detail2,...detailn
documents will be queried for specific time, and then filtered on detail1, or on detail2, and so on. Specific time would result in 100k documents returned, detail1 will result in lets say 20k records returned.

Christian_Dahlqvist · October 9, 2023, 9:28am

How many of these queries, each returning thousands of documents, would run per second?

If you are able to target a select few indices, how many queries will target indices on the hot nodes vs older indices on the warm nodes?

mr_zer0 · October 9, 2023, 9:41am

Number of queries would be small, but it will return huge number of documents. I think max 20 queries per second.

All queries are supposed to run on hot nodes, only historic reporting would refer older nodes and it maybe very nominal. Maximum prioritizing is for hot nodes and most near real time data.

Christian_Dahlqvist · October 9, 2023, 9:50am

That is dramatically different from what you initially stated, and much more in line with the assumptions around a hot-warm architecture. Given the amount of documents returned and the query concurrency it is still a significant query load though. How well you can target your queries will be an important factor as the fewer shards you need to query, the easier it will likely be to handle the number of concurrent queries you are specifying.

Given the size of the data set and the fact that you are retrieving documents from disk and not running aggregations, I suspect you will be limited by disk I/O due to the large number of randon disk reads. I therefore suspect you may need a higher amount of hot nodes that in your examples, but this is likely something you will need to benchmark as it depends on document sizes and how targeted and efficient the queries are.

The warm nodes can potentially be quite dense but this will depend on the latencies required when you do query these nodes.

mr_zer0 · October 9, 2023, 9:59am

There is a shard limit of 2147483519 documents right ? So we would need atleast 3 shards to store 1 day of data.

Yes i am planning for high performance SSD cluster, but i doubt it will still function as expected within expected performance.

Is elasticsearch heap still limited to 32 GB? i understand the concept of no memory gain if data is between 32 and 42 GB becaused of Compressed OOPS. But can we increase heap more than 42GB and will elastic use it and will it be of any gain if i go for 128GB machine ?

Christian_Dahlqvist · October 9, 2023, 10:15am

Yes, that is correct. You typically want to limit the shard size to less than 50GB or so, which may mean that you need a larger number of primary shards. How many will depend on your data and mappings though.

SSDs are surely required, but whether it will or will not meet your latency and performance requirements is hard to tell without benchmarking or testing.

I believe it is still recommended to be below 31GB-32GB, even though G1GC is better at handling larger heaps than CMS GC. When deploying on large machines a lot of users use containers and run multiple nodes per host.

system · November 6, 2023, 10:15am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Advice on elasticsearch + kibana project Elasticsearch	1	344	May 19, 2020
Memory/cpu ratio to disk size Elasticsearch	7	5491	October 24, 2019
Designing elasticsearch cluster for a SOC Elasticsearch	5	904	February 19, 2023
Elasticsearch Resource requirements Elasticsearch	4	5378	February 6, 2019
ElasticSearch Performance Elasticsearch	4	348	October 12, 2020

Elasticsearch design feasibility

Related topics