Elasticsearch ingest large of data

is there capability of Elasticsearch to ingest amount large of data at a rate of 300TB in the year after that archive data in offline storage?
what is challenges facing me to ingest that much volume?

I would say that Elasticsearch should be able to handle this data rate, but I'd recommend looking into a few key concepts:

  1. 300TB is a lot of data, having that all on one cluster would probably require a fairly large cluster. If possible, try to break the data into logical groupings and have an Elasticsearch cluster for each logical group. You can then use cross cluster searching to have a single "search" cluster, search all of your logical group clusters.
  2. At this scale, you'd most likely want dedicated ingest and coordinating nodes.
  3. If possible, look into using Elasticsearch's data tiering system. This will allow for greater data storage density with lower costs if possible.
  4. At this scale, you'll want to make sure your mappings and ingest pipelines are as efficient as possible (and thoroughly tested) to ensure reliable performance.
  5. Ensure you're on the latest version of Elasticsearch. Scalability has gotten significantly better in the last few versions (and is continuing to improve), being able to take advantage of these improvements will probably make this far more pleasant.
  6. No matter how you look at it, you will need some powerful systems (CPU, RAM, Disk, Network) to be able to support this type of data rates. I would recommend you someone try to do some sort of semi-scale load testing to ensure you're able to handle this.
  7. Somewhat related to #6, ensure your backup solution is able to handle backing up the data fast enough. During the backup process, there are some other things that don't work (example: Indices can't rollover during a backup, if the index is included in the backup). Ensuring that your backups finish in a timely manner will reduce other potential issues that might come up as a result of them taking a while to run.

I'm sure I'm missing some other considerations, but others here might be able to provide more insight

As a case of theory, using https://cloud.elastic.co/pricing, you can build a cluster (using all data tiers) that supports ~11.5PB of data total. (Note that this is mainly just for theory, a single cluster this large, while probably do-able, probably isn't the best idea; See point #1)

2 Likes

Thank You For sharing, regarding this point,

If I Need achieve this capability internally and I will provide the powerful of system like ( CPU, RAM, Disk, Network ) and I will take that into consideration index and search speed tuning .

In your epinine what is differences between cross cluster searching and single cluster searching, you mean the cross cluster searching will be provide the high performance in side the searching and index speed?
what will the provide for the logical group clusters?

In your opinion what is differences between cross cluster searching and single cluster searching, you mean the cross-cluster searching will be provide the high performance inside the searching and index speed?

There are a few pros and cons to each architecture and not all are strictly performance related.

Cross-cluster Search

Pros:

  • Smaller clusters
    • When Elasticsearch makes a change, it populates that change across all nodes, the more nodes in a cluster, the longer it takes for that change to propagate, slowing down things.
      • Note: I recall there being a post somewhere about this, but couldn't find the link, if this info is incorrect, or if someone has the link, please let me know.
    • Shorter upgrade durations.
      • On average you can assume that it takes ~15-20 to upgrade an Elasticsearch node. With this in mind, the more nodes in a cluster, the longer it will take to perform a full (rolling) upgrade. Smaller clusters allow for doing one cluster at a time in a quicker interval.
    • Upgrade validation
      • While you should always be testing new versions of Elasticsearch in some sort of test environment, no environment will ever truly match production. By being able to upgrade only part of your production infrastructure at a time, you have a better chance of detecting issues and "fixing" them, without impacting your entire production environment.
  • Targeted performance
    • If you know what data lives on a specific cluster in a cross-cluster search setup, you can more specifically target the resources of that cluster to obtain the desired performance for that data.
  • Requires you to understand your data better
    • To be able to setup a viable cross-cluster search architecture, you must understand your data well enough to know how it will be used and what it is. Without this knowledge you probably would have a great cross-cluster search architecture
    • There is a secondary advantage here, if you understand your data better, you can get improvements elsewhere like index mappings.
  • Scaling
    • While I'm pretty sure a single Elasticsearch cluster can now scale to petabytes of data and 10's of thousands of indices, there is probably some eventual limitation out there. When using cross-cluster this limitation effectively goes away as you can probably have a large number of clusters as part of the cross-cluster search setup.

Cons:

  • More clusters
    • With Cross-cluster search, you need to deal with multiple clusters, which generally adds "management" overhead in the form of additional configuration for each cluster, upgrading, and monitoring them. This adds to the overall "total cost of ownership", and things to worry about
    • More complex Architecture, similar to the above point, cross-cluster search does require a more complex architectural setup than its alternative, a single cluster.
  • Requires you to understand your data
    • While this is a pro, it could also be considered a con. If you currently don't understand your data and how it will be used, trying to gain this knowledge to be able to properly implement cross-cluster search can take a significant amount of time if your data is large and covers a wide range of "areas".
  • Feature parity
    • There are some features currently in Elasticsearch that don't support/work well with cross-cluster search. It's important to know what features you want to use and validate that they work with it.
      • Note: Elasticsearch has been making recent enhancements and many more features now support cross-cluster search.

Single Elasticsearch Cluster

Pros

  • Relatively simple architecture

    • While these cluster could grow quite large, the overall architecture for a single cluster, is simpler than a cross-cluster setup
  • Don't really need to fully understand your data

    • Because you're just dumbing everything into a single cluster, you don't truly need to understand your data.
  • Feature support

    • All Elasticsearch features work against a single Elasticsearch cluster
  • Lower total cost of ownership

    • With a single cluster, you only need to worry about one thing, so there is overall less "management" overhead
      Cons
  • Relatively Limited scaling

    • When compared to cross-cluster search, a single cluster can't scale as large as multiple clusters
  • Un-targetable performance

    • When using a single cluster, your data in mainly mixed together, meaning if you have some data you want to perform better than other data, you need to scale the whole cluster
      • Note: I recognize that there are in theory ways to do this, however, they're relatively advanced topics, and in practice a cross-cluster search setup is probably "simpler"
  • Long upgrades

    • As mentioned, it takes ~15-20 minutes to upgrade an Elasticsearch node, the more nodes you have, the longer it takes.
  • Potentially lower performance

    • As the Elasticsearch cluster grows in number of nodes, the more additional work and overhead gets added to the cluster to just operate, this has the potential to slow down the cluster in unexpected ways.

Note: While this is mainly comparing cross-cluster search and a single cluster. It is important to note, that in theory you can start with one of the architectures and transition to the other if your use-case changes. (It will require additional work to rearchitect though)

what will the provide for the logical group clusters?

I don't really understand this question.

1 Like

I mean What will provide the logical group clusters?
what is addition will it give the logical group clusters to improve the performance?

What will provide the logical group clusters?

This requires you to know and understand your data and how it can be segmented into "logical" groups that could become their own Elasticsearch cluster as part of a cross-cluster search architecture.

what is addition will it give the logical group clusters to improve the performance?

See the pros of Cross-cluster Search on my last response. (Not everything is about raw performance are larger scales)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.