Elasticsearch design feasibility

Hello All,

Would need your help in validation of few numbers, based on your experience.
Would be happy to get a response or some feedback.

I wanted to check if the below numbers are feasible to be handled by Elasticsearch.

100k documents per second
8.5 billion + documents in a day
Kafka topics will feed data. Maximum delay of 3 mins between data generation and for it to be consumed by elasticsearch.
3 primary/1 replica shard setup

Queries:2x the number of records inserted in a second. 200k or 300k per second on an average.

Based on above figures, how much would the data size be per day? I assumed it would be around 2000Gb, but is it more or less?
I had used elasticsearch sizing guides and arrived at below numbers

2000 GB * 7 days * ( 1 + 1 ) = 28000 GB ######## 28000 GB * 1.25 = 35000 GB ########## 35000/ 64 GB/ 30 + 1 = 19 hot nodes

2000 GB * 180 days * ( 1 + 1 ) = 720000 GB ######## 720000 GB * 1.25 = 900000 GB ####### 900000/ 64 GB / 160 + 1 = 88 nodes

(increased ratio of 50 and 250 )

2000 GB * 7 days * ( 1 + 1 ) = 28000 GB ######## 28000 GB * 1.25 = 35000 GB ########## 35000/ 64 GB/ 50 + 1 = 11 hot nodes

2000 GB * 180 days * ( 1 + 1 ) = 720000 GB ######## 720000 GB * 1.25 = 900000 GB ####### 900000/ 64 GB / 250 + 1 = 57 warm nodes

These numbers look too high but if its correct decided on below hardware.

CPU – minimum 32 cores per machine

RAM – 64 GB

11 hot nodes ---------------- Disk – 3 TB SSD per node

57 warm nodes -------------- Disk – 16 TB SSD per node

5 Master nodes(1.9 TB/2 TB storage not required on these nodes)

  1. Does the above design of elasticsearch make sense? Unfortunately, i am unable to setup a small instance and test it out, so have to rely on estimates only.

  2. Can we use compression for data older than 7 days? How do we do it? I do not want the current data to be compressed. According to online guides, this would result in 10-20% storage gains.

  3. Suggestion: The 32 GB heap limit is too restrictive. It needs to be a 64 bit java process for such cases. Is it still the case of there is a 64 bit version or multiple parallel 32 G processes in pipeline?

Let me know if anything is needed.

How did you arrive at this? Are you indexing into a single index or a set of indices? Is your data immutable? Are you planning on using rollover/data streams?

Can you please provide more details about the use case and the nature and purpose of these queries. How targeted, in terms of indices, are these queries?

This sounds like a lot, lot higher than what is generally assumed for a hot-warm architecture, so I seriously doubt the standard way of sizing hot-warm architectures apply here.

Thanks for the response Christian.

How did you arrive at this? Are you indexing into a single index or a set of indices? Is your data immutable? Are you planning on using rollover/data streams?
This is our standard shard setup for all deployments. We will be deploying into multiple indexes, created for each day, for each topic lets say. Data is immutable. Not aware about streams.

Can you please provide more details about the use case and the nature and purpose of these queries. How targeted, in terms of indices, are these queries?
Data is immutable, we are not looking to update any documents once its inserted. Assuming 100k insertions per second, we are looking to read those 100k documents atleast once to search for failed cases based on a field, then use that for further analysis. And another 100k to 200k assuming we report those data in two reports/visualizations.

This sounds like a lot, lot higher than what is generally assumed for a hot-warm architecture, so I seriously doubt the standard way of sizing hot-warm architectures apply here.
Yes this looks too high to be possible in elasticsearch. I am also checking if Apache spark or hive can be used for this.

Are you fetching documents by ID or running queries for specific fields? When you are searching, will you know which index the data resides in or will you need to search all indices?

You are talking about reading 100k documents, but I am not sure how this translates into 100k+ queries per second. Can you please elaborate?

We will be fetching by specific fields for time, various other filters. it will search only specific indexes, not all the incides.

lets assume 10 documents each with its own time, customer details, detail1,detail2,...detailn
documents will be queried for specific time, and then filtered on detail1, or on detail2, and so on. Specific time would result in 100k documents returned, detail1 will result in lets say 20k records returned.

How many of these queries, each returning thousands of documents, would run per second?

If you are able to target a select few indices, how many queries will target indices on the hot nodes vs older indices on the warm nodes?

Number of queries would be small, but it will return huge number of documents. I think max 20 queries per second.

All queries are supposed to run on hot nodes, only historic reporting would refer older nodes and it maybe very nominal. Maximum prioritizing is for hot nodes and most near real time data.

That is dramatically different from what you initially stated, and much more in line with the assumptions around a hot-warm architecture. Given the amount of documents returned and the query concurrency it is still a significant query load though. How well you can target your queries will be an important factor as the fewer shards you need to query, the easier it will likely be to handle the number of concurrent queries you are specifying.

Given the size of the data set and the fact that you are retrieving documents from disk and not running aggregations, I suspect you will be limited by disk I/O due to the large number of randon disk reads. I therefore suspect you may need a higher amount of hot nodes that in your examples, but this is likely something you will need to benchmark as it depends on document sizes and how targeted and efficient the queries are.

The warm nodes can potentially be quite dense but this will depend on the latencies required when you do query these nodes.

There is a shard limit of 2147483519 documents right ? So we would need atleast 3 shards to store 1 day of data.

Yes i am planning for high performance SSD cluster, but i doubt it will still function as expected within expected performance.

Is elasticsearch heap still limited to 32 GB? i understand the concept of no memory gain if data is between 32 and 42 GB becaused of Compressed OOPS. But can we increase heap more than 42GB and will elastic use it and will it be of any gain if i go for 128GB machine ?

Yes, that is correct. You typically want to limit the shard size to less than 50GB or so, which may mean that you need a larger number of primary shards. How many will depend on your data and mappings though.

SSDs are surely required, but whether it will or will not meet your latency and performance requirements is hard to tell without benchmarking or testing.

I believe it is still recommended to be below 31GB-32GB, even though G1GC is better at handling larger heaps than CMS GC. When deploying on large machines a lot of users use containers and run multiple nodes per host.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.