ES Single Index Scalability Limited to 3TB of RAM?

I am curious if any ES users have scaled their systems to reliably handle several terabytes of data or beyond. My back-of-the-napkin calculations below suggest that an ES system is limited to about 3TB of RAM, which makes extending far beyond that much in data/index/etc. difficult or not possible.

Edit: To clarify, I mean scaling a single index without using routing beyond 3TB of RAM may not be possible. This puts a practical limit on the max index size.

In particular:

  • Shards: Adding more shards allows you to spread the work across more nodes. However, too many nodes increases network usage and latency. Each application is different, but the documentation makes clear that 1000 shards is far too much. I have not tested it, but I get the impression that you start getting serious performance issues with 50+ shards.
  • Memory: You can increase memory in each node up to 31GB, after which you pay a price in RAM due to 64 bit pointers. However, there is another price for breaking the 31GB barrier: longer garbage collections. On my 31GB machines, it easily takes 5 to 30s to perform an old 30GB GC. I imagine it could take minutes to collect 100GB, which is not acceptable. Therefore, you are effectively limited to roughly 31GB anyway.
  • Routing: Setting up routing improves performance, but also limits what searches can be run quickly. My understanding is that queries across many routes will not improve. But more to the point, requiring users to logically divide their data is the opposite of scalability.
  • Replicas: Adding replicas will artificially increase the total RAM in the system and may improve performance (could hurt from increased network usage), but adding replicas does not have a substantial impact on scalability.

Putting this together, assuming each 31GB node has one shard and 100 shards is the practical limit, then an ES system is limited to 3TB of RAM (=31GB * 100 Shards / 1024). A system with 3TB of RAM may be able to handle 3TB of data depending on the index structure. However, it would probably need to avoid advanced ES features and scaling significantly beyond 3TB seems harder to imagine.

If I missing something here, I would really like to hear about it. I would also like to hear of experiences approaching several TB of data. Do you ditch the advanced ES functionality? Do you live with longer than snappy query times? Is logically dividing data through routes or separate indices mandatory? Or are some of my assumptions above incorrect?

1 Like

Nope.

I've seen a 96GB heap take 30 min, so yeah.

What makes you say that?

Huh?

Depends on your requirements. I have seen users with a N:N node:replica count and that works great for them.

This is where your assumptions fall apart unfortunately :frowning: But don't despair! :smiley:

We have customers doing PB scale - https://www.elastic.co/elasticon/conf/2016/sf/tapping-out-security-threats-at-fireeye, https://www.elastic.co/elasticon/conf/2016/sf/scaling-elasticsearch-at-netsuite-good-bad-and-ugly - and with that comes (tens of) thousands of shards across thousands of indices.

1 Like

But how many PB scale systems do you see with only a single index? Without any routing?

Dividing the data up into separate indices or routes defeats the purpose of a scalable system, because it requires that you logically divide the data. You have to sit back and think about where you want your data to go. For certain types of data this is relatively easy (e.g., logs can be divided in time, customer databases can be separate, etc.). But what if your data is not so easily divided up (e.g., Google Book Search, DNA sequencing)? By splitting data into indices or along routes, you are manually dividing the dataset into smaller datasets. That's not scaling. That's being a librarian categorizing datasets. Ideally ES could grow to petabytes without manual restructuring data.

If your query hits all of your routes, you get no performance speedup. It only helps if you can craft queries where you know all of the results will be on a single route.

Going from 0 to 1 replications (and doubling the number of nodes) will double the amount of RAM in the system, but won't double the performance or the number of documents.

This is wrong. Production runs here ~1000 shards on a three node cluster with each node having 16 GB RAM.

Java JVM provides G1 GC and that garbage collector works with reliable, predictable GC pauses on heap sizes over 16 GB. Beside this, there is not much advantage having a heap larger than 31 GB heap per JVM. It's a myth that just adding more heap will always scale. 16 GB heap is more than enough for a JVM on today's multi core CPUs - not related to Elasticsearch.

Collecting 100 GB on heap with low pauses will be possible with the Shenandoah garbage collector in predictable time. This is still future OpenJDK: Shenandoah

The assumption that routing does "improve performance" is wrong. It is an effective organization for indexing/search, it saves a lot of redundant network and CPU cycles through the cluster . But request processing or responses are not faster from client perspective. So, routing improves scalability.

Here, the assumption that replica levels are improving performance is also wrong. Replica levels improve reliablity. In case of search operations, replica levels contribute to improved throughput, that means, a cluster can process more search request per time. As always, there is a tradeoff. The cost for improved throughput is much more resource consumption.

100 shards per cluster are not the practical limit.

You should study more about horizontal and vertical scaling. To get performance out of Elasticsearch, you don't need to rely on vertical scaling. Horizontal scaling works best.

There is no practical limit of scaling an ES cluster horizontally, you can throw hardware in it and it scales. If you hit a limit, then you

  • are either doing things terribly wrong
  • or you are too rich and have added too much hardware to the cluster

Using my three nodes in the test cluster, I could ramp up more than 10k shards - but that is not where I'm after. The design decisions should always consider the complexity of the field mappings and the workload type of indexing/searching. Beside, the shard count and heap size are not proportional. There are many different factors that determine effective heap use. Think of shards as "growing/shrinking" entities, they merge segments, they receive massive bulks etc. After exceeding a certain size, the handling becomes too expensive (recovery, segment merge, snapshot/restore, moving from one node to another). The exact critical shard size can be determined by running tests for measuring it, for small hardware systems it's around 1-10 GB, for large systems it may reach 50 GB.

In summary, my three nodes in production are running on 3 x 2.8 TB disk space, 3 x 16 GB heap (on 64 GB RAM each), and I'm driving ~100 indices with ~1000 shards, occupying 3 x 350 GB disk space, and I can expect a dynamic growth by a factor 2-4 on these machines. And after that, I can just add new hardware.

From Elastic company perspective, such a cluster a considered a very small one.

Hi @jprante,

Appreciate the response, but I think there is a misunderstanding. I'm not trying to increase shards for the sake of it (like putting 100 shards on a single node). I am looking to increase shards solely to increase the total amount of data on the cluster by spreading the shards to more nodes. My point is that there is a limit to the size of the individual machines (31GB RAM) and the number of possible nodes on one index without routing (~100 nodes with 1 shard per node). That effectively limits the amount of RAM available to the index, which effectively limits the amount of data that you can work with.

As you mentioned, your cluster with 3 nodes and 16GB of RAM is rather small. You have plenty of room to scale. In my system, I have about 800GB of RAM devoted to a single index and even though I still have room to grow, I'm starting to see the ceiling.

The main question is, how do you scale a single index, which does not use routing, to the many terabyte or petabyte scale? Increasing node size stops working after a point. So does increasing number of nodes. I haven't seen a satisfactory answer.

The only limit is the ~2 billion docs per shard limit that we inherit from Lucene.
There are other practical limits, but this 100 shard one is just not right.

The comments leading up to this are a generating a strawman.

Unless there is some magical system someone has sitting in a closet they aren't sharing, every system needs logical curation or categorisation. RDBMSs use tables, ES/mongo/cassandra use indices/tables, hadoop uses files.
Books can be by author surname, DNA can be by, well I dunno, but there would be something.

I've added emphasis above because you didn't mention that in your original comments, hence my question. but;
Adding N replicas increases the document count by docs*N.
Adding replicas does not require you to add more nodes.

Doing a query on multiple routes only targets the shards the routes apply to. Doing a query on all routes will be slower than a single one, yes, so don't do that, it defeats the purpose of routing.

Ultimately you will not find a golden bullet to your problems, you need to make comprises with whatever platform you use.

[quote="warkolm, post:6, topic:61431"]
There are other practical limits, but this 100 shard one is just not right.[/quote]
I have not tested it, but I am pretty skeptical that a single index split among 100 nodes with 1 shard each (100 shards total) works well. In a single index, every time you make a query, you have to distribute that query to all 100 nodes on the network. The results will be sent back to the client node, which then has to put together 100 result sets. I would be very surprised if this worked well.

[quote="warkolm, post:6, topic:61431"]
every system needs logical curation or categorisation. ... mongo/cassandra use indices/tables, hadoop uses files.[/quote]
You shouldn't have to split up your mongo tables because they get too big. You just have to add more hardware. Similarly, you should not have to split up your ES indices because they get to big. You should simply be able to add more hardware. However, for ES, I suspect the limit is around 100 nodes with 31GB in each node.

Yes, looking back now, my original point was confusing. I was assuming here that we were already pushing the hardware to the limit, so adding replications would require you to add nodes or make your existing nodes bigger.

Why would you be querying something that goes across the entire dataset when you are using routing?

You obviously have a whole bunch of assumptions and requirements you aren't sharing, and the questions you are asking are making things more confusing.

My question is only this: how big of an index can Elasticsearch handle before you have to start manually partitioning it (e.g., using routing, splitting up indices, etc.)? That was the premise from the beginning, but apologies if that did not come across clearly. My hypothesis is that the size of the index will be limited by the roughly 3TB of RAM that the cluster will be able to devote to it.

That's ok, it's all a journey :slight_smile:

I don't know the answer to this and our automated performance tests don't cover this sort of use case. I'll see if anyone else has comments though!

Interesting conversation so far. The biggest "single index" search use case i've seen had around 9 TB of RAM behind it and ran into SAN parallel storage IOPS limitations before it ran into scalability limits merging search results across all the shards (way over 100)

@Michael_Sander if you are working within real world constraints of real hardware the math involved in distributed systems (Amdahl's law) and you don't have something in the data model or queries that lends itself to partitioning of the data then every software system no matter what it might run on is going to run into a scalability limit. AKA when adding more horizontal hardware doesn't actually make anything faster or able to handle more throughput.

"That's being a librarian categorizing datasets" is an interesting comment to throw into the mix. In fact, librarians built the first search indices to solve scalability problems with practical compromises. They did this thousands of years ago. I know this is frustrating, and I bet the Babylonian (or whatever) that invented the first search index felt equally frustrated when he/she was forced to categorize some scrolls or tablets or whatever that seemed to defy categorization.

Picking and broadcasting a number for ES's current limit is a mistake though. It is so determined by the nature of your hardware, networks, performance expectations, indexing characteristics, data model, and queries. I encourage you to test what you real bottle necks are and to try to validate some of your skepticism with empirical testing.

Hope that helps

I'll definitely be doing some testing, but I was hoping someone here had already done so.

Your example is exactly what I was hoping someone would point out. That is a very large index. Do you recall if this index used routing (a form of manual partitioning)? Also, I am curious as to the type of nodes they used and the number. Did they use 1 shard per node? Finally, were there replications? If so, that could "artificially" increase the perceived RAM in the system.

Yes, librarians break datasets down into categories allowing you to search each category independently; this was scaling before there was the word scaling. This job is not too different from breaking a big index down into smaller indices, or by partitioning a single index with routing.

The problem with this approach however, is that your ability to scale now becomes proportional to the number of librarians you have on staff. If you want to grow your dataset by X%, you'll need to hire an additional k*X% people to re-categorize and distribute the data. That fundamentally changes the cost structure of storing data. Rather than the cost being proportional to the cost of computing systems, it is now proportional to human capital. Imagine if Google had to hire 10% more people every time their index increased 10%, rather than just buy 10% more servers.

There are single-index systems out there that, in theory, can horizontally scale indefinitely without manual partitioning (e.g., Bigtable, Mongo). That said, these systems don't support many of the nice features as Elasticsearch (they rely on eventual consistency and don't use joins). I don't expect a single index on Elasticsearch to scale indefinitely, but I was hoping it would be much higher than 3TB of RAM.

My healthy-skepticism best-guess is that the practical limit for "single index" is higher than 9TB or RAM. Eventually shard counts will get too high for network saturation and merge-time induced request latency. I'm can look forward to seeing my first 100's of of PB use case, but I expect the limiting factor has more to do with hardware budget than ES.

Mongo works within the same constraints. Read up on Amdahl's law. There is no magic here. I can't comment on BigTable. Neither are search engines though, so the assumptions and expectations will be different.

The 9 RAM TB system was a single index with no manual partitioning. Multiple shards per box. Replication level 1 for HA. Replication doesn't artificially increase the perceived RAM of the system. It's the same RAM now handling more shards. For a full scatter-gather half the shards are ignored, so it isn't any different than replication level 0 until you start talking about concurrent query execution. This one had something like 150 data nodes X 64 GB physical RAM. They can go higher if they find more hardware money, so they haven't hit an ES limit yet.

The larger clusters that I've seen use time based indexes (partitioning). The queries span multiple indexes and at that point of query execution are just a collection of shards. (indexes are just a logical concept in ES) After the index patterns are resolved, querying over a large number of shards is basically identical to your "single index" setup from a read perspective.

Divide and conquer

1 Like

FWIW.

I have an index that I am already partitioning. It's daily index. In several data centers, I am hitting over 10TB. Two DCs consistently go over 14TB. So I have to partition by DCs and use tribe node to federate them.

These clusters are using 30 to 33 nodes respectively. I am in the process of adding more nodes to each cluster.

This is good. But to be clear, a cluster with 9TB of RAM, on nodes with 31GB of RAM each, means that you would have about 300 nodes in your cluster. Your system was rep 1, so every time a query is made, the client would need to send that query to 150 nodes on average, and then the client would have to merge all the results. That's certainly possible, and I'm glad others are optimistic, but I don't think my skepticism is misplaced.

The largest scale databases Mongo (and Bigtable) rely on eventual consistency. This allows them to grow much larger than a typical db, and Amdahl's law doesn't really apply because operations become constant time (not proportional or even logN proportional to the size of the data). But yes, these NOSQL databases are very different creatures from search engines. I only raised them in response to the point that "every system needs logical curation," and these provide a counter example.

I knew this point was going to be pounced on, but I stand by it. If you go from rep 0 to rep 1, you'll need to double the number of shards in your cluster. If your nodes are already running at reasonably high utilization, doubling the number of shards will lead to overutilization and instability. You'll need to double the number of nodes in the cluster, which leads to doubling the RAM of the cluster. So going from rep 0 to rep 1 can very often lead to a doubling of RAM in the cluster, but without increasing the total data capacity.

Similarly, on your 9TB RAM cluster with rep 1, a bunch of RAM (roughly 4.5TB) is devoted to replications. Thus, while the total RAM of the cluster is 9TB, the amount of RAM devoted to unique data is closer to 4.5TB.

One final point. This 9TB system was made up of 150 nodes, each with 64GB of RAM, but presumably only 31GB was devoted to ES (the rest was for disk cache). Further, assuming you buy my argument regarding the rep 1 artificially doubling the RAM of the cluster, then this cluster really only had 2.25TB of RAM devoted to ES (divide by 2 for disk cache, divide by 2 for rep 1). This cluster still falls under my original hypothesis of 3TB.

@Michael_Sander Sorry if you feel your points are being pounced on. Not my intention.

Ha, not at all. "[A]rtificially increas[ing] perceived RAM" was a vague statement, I set myself up for it.

This is a very complicated topic but we have customers with systems of the scale you're talking about. The largest single index on a single cluster we've seen so far was about 90TB total storage or about 30TB excluding replicas. That was 600 total shards spread across 96 nodes. We did eventually break that up for various reasons that had nothing to do with search (even though there wasn't really a natural way to partition it) but that scale isn't that hard for ES to manage.

With ES you really should just be prepared to partition. Starting with shards, then indices and at really large scale clusters. Embracing that idea is key to building a manageable system that scales but it does require some work. Note: the keyword there is "manageable". The largest we have now is about 1PB spread across 8 clusters tied together with tribe nodes. That's 660 nodes and about 20TB Heap.

Just remember that even though you're doing physical partitioning of the data ES has great features to continue to provide a single logical view of the data.

BTW, on the subject of routing... be very, very careful with that. The number one thing about keeping ES happy at scale is BALANCE. Routing if done incorrectly can easily get things way out of balance and will lead you to all kinds of nasty operational issues that are a real pain to resolve.

Extra BTW, No system scales indefinitely without partitioning of some form. If you think Mongo can do this then there's no way you've ever actually worked with it. Every distributed data storage system has limitations of some sort in this area, it's just a question of exactly how they manifest. Even Hadoop starts to hit limits as you get into thousands of nodes in a single cluster.

Kimbro

1 Like