Elasticsearch ingest large of data

m3bgwad · December 28, 2022, 1:04pm

is there capability of Elasticsearch to ingest amount large of data at a rate of 300TB in the year after that archive data in offline storage?
what is challenges facing me to ingest that much volume?

BenB196 · December 28, 2022, 3:10pm

I would say that Elasticsearch should be able to handle this data rate, but I'd recommend looking into a few key concepts:

300TB is a lot of data, having that all on one cluster would probably require a fairly large cluster. If possible, try to break the data into logical groupings and have an Elasticsearch cluster for each logical group. You can then use cross cluster searching to have a single "search" cluster, search all of your logical group clusters.
At this scale, you'd most likely want dedicated ingest and coordinating nodes.
If possible, look into using Elasticsearch's data tiering system. This will allow for greater data storage density with lower costs if possible.
At this scale, you'll want to make sure your mappings and ingest pipelines are as efficient as possible (and thoroughly tested) to ensure reliable performance.
Ensure you're on the latest version of Elasticsearch. Scalability has gotten significantly better in the last few versions (and is continuing to improve), being able to take advantage of these improvements will probably make this far more pleasant.
No matter how you look at it, you will need some powerful systems (CPU, RAM, Disk, Network) to be able to support this type of data rates. I would recommend you someone try to do some sort of semi-scale load testing to ensure you're able to handle this.
Somewhat related to #6, ensure your backup solution is able to handle backing up the data fast enough. During the backup process, there are some other things that don't work (example: Indices can't rollover during a backup, if the index is included in the backup). Ensuring that your backups finish in a timely manner will reduce other potential issues that might come up as a result of them taking a while to run.

I'm sure I'm missing some other considerations, but others here might be able to provide more insight

As a case of theory, using https://cloud.elastic.co/pricing, you can build a cluster (using all data tiers) that supports ~11.5PB of data total. (Note that this is mainly just for theory, a single cluster this large, while probably do-able, probably isn't the best idea; See point #1)

m3bgwad · December 29, 2022, 7:45am

Thank You For sharing, regarding this point,

If I Need achieve this capability internally and I will provide the powerful of system like ( CPU, RAM, Disk, Network ) and I will take that into consideration index and search speed tuning .

In your epinine what is differences between cross cluster searching and single cluster searching, you mean the cross cluster searching will be provide the high performance in side the searching and index speed?
what will the provide for the logical group clusters?

BenB196 · December 29, 2022, 12:59pm

In your opinion what is differences between cross cluster searching and single cluster searching, you mean the cross-cluster searching will be provide the high performance inside the searching and index speed?

There are a few pros and cons to each architecture and not all are strictly performance related.

Cross-cluster Search

Pros:

Smaller clusters
- When Elasticsearch makes a change, it populates that change across all nodes, the more nodes in a cluster, the longer it takes for that change to propagate, slowing down things.
  - Note: I recall there being a post somewhere about this, but couldn't find the link, if this info is incorrect, or if someone has the link, please let me know.
- Shorter upgrade durations.
  - On average you can assume that it takes ~15-20 to upgrade an Elasticsearch node. With this in mind, the more nodes in a cluster, the longer it will take to perform a full (rolling) upgrade. Smaller clusters allow for doing one cluster at a time in a quicker interval.
- Upgrade validation
  - While you should always be testing new versions of Elasticsearch in some sort of test environment, no environment will ever truly match production. By being able to upgrade only part of your production infrastructure at a time, you have a better chance of detecting issues and "fixing" them, without impacting your entire production environment.
Targeted performance
- If you know what data lives on a specific cluster in a cross-cluster search setup, you can more specifically target the resources of that cluster to obtain the desired performance for that data.
Requires you to understand your data better
- To be able to setup a viable cross-cluster search architecture, you must understand your data well enough to know how it will be used and what it is. Without this knowledge you probably would have a great cross-cluster search architecture
- There is a secondary advantage here, if you understand your data better, you can get improvements elsewhere like index mappings.
Scaling
- While I'm pretty sure a single Elasticsearch cluster can now scale to petabytes of data and 10's of thousands of indices, there is probably some eventual limitation out there. When using cross-cluster this limitation effectively goes away as you can probably have a large number of clusters as part of the cross-cluster search setup.

Cons:

More clusters
- With Cross-cluster search, you need to deal with multiple clusters, which generally adds "management" overhead in the form of additional configuration for each cluster, upgrading, and monitoring them. This adds to the overall "total cost of ownership", and things to worry about
- More complex Architecture, similar to the above point, cross-cluster search does require a more complex architectural setup than its alternative, a single cluster.
Requires you to understand your data
- While this is a pro, it could also be considered a con. If you currently don't understand your data and how it will be used, trying to gain this knowledge to be able to properly implement cross-cluster search can take a significant amount of time if your data is large and covers a wide range of "areas".
Feature parity
- There are some features currently in Elasticsearch that don't support/work well with cross-cluster search. It's important to know what features you want to use and validate that they work with it.
  - Note: Elasticsearch has been making recent enhancements and many more features now support cross-cluster search.

Single Elasticsearch Cluster

Pros

Relatively simple architecture
- While these cluster could grow quite large, the overall architecture for a single cluster, is simpler than a cross-cluster setup
Don't really need to fully understand your data
- Because you're just dumbing everything into a single cluster, you don't truly need to understand your data.
Feature support
- All Elasticsearch features work against a single Elasticsearch cluster
Lower total cost of ownership
- With a single cluster, you only need to worry about one thing, so there is overall less "management" overhead
  Cons
Relatively Limited scaling
- When compared to cross-cluster search, a single cluster can't scale as large as multiple clusters
Un-targetable performance
- When using a single cluster, your data in mainly mixed together, meaning if you have some data you want to perform better than other data, you need to scale the whole cluster
  - Note: I recognize that there are in theory ways to do this, however, they're relatively advanced topics, and in practice a cross-cluster search setup is probably "simpler"
Long upgrades
- As mentioned, it takes ~15-20 minutes to upgrade an Elasticsearch node, the more nodes you have, the longer it takes.
Potentially lower performance
- As the Elasticsearch cluster grows in number of nodes, the more additional work and overhead gets added to the cluster to just operate, this has the potential to slow down the cluster in unexpected ways.

Note: While this is mainly comparing cross-cluster search and a single cluster. It is important to note, that in theory you can start with one of the architectures and transition to the other if your use-case changes. (It will require additional work to rearchitect though)

what will the provide for the logical group clusters?

I don't really understand this question.

m3bgwad · December 29, 2022, 1:21pm

I mean What will provide the logical group clusters?
what is addition will it give the logical group clusters to improve the performance?

BenB196 · December 29, 2022, 1:24pm

What will provide the logical group clusters?

This requires you to know and understand your data and how it can be segmented into "logical" groups that could become their own Elasticsearch cluster as part of a cross-cluster search architecture.

what is addition will it give the logical group clusters to improve the performance?

See the pros of Cross-cluster Search on my last response. (Not everything is about raw performance are larger scales)

system · January 26, 2023, 1:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is ElasticSearch meant for long term storage of large datasets? Elasticsearch	3	926	March 21, 2020
Elasticsearch bulk Ingestion Elasticsearch	4	352	May 19, 2021
Architecture production Elasticsearch Elasticsearch	2	334	July 6, 2018
Elasticsearch Capacity Planning Help Required Elasticsearch	3	571	November 24, 2019
HW recommendations and best practices for a big index rate(3TB/day) Elasticsearch	6	1213	July 5, 2017

Elasticsearch ingest large of data

Related topics