ES Single Index Scalability Limited to 3TB of RAM?

Okay, 90TB of storage, but how much RAM across your 96 nodes? My point is generally that you can scale horizontally to roughly about 100 nodes (with no replicas, one index), and each node is roughly limited to about 31GB of RAM devoted to ES. So you're effectively limited to 3TB of RAM for your index. Storage size, seems less meaningful because (a) ES is often not used as a main data store; (b) ES is a search engine, and the index is what matters, and which should for performance reasons, mostly fit into RAM.

I get it. It's just a lot of work to partition. And if you have an exponentially growing database, you will have to partition more and more as time goes on. Partitioning is not scalable.

I'm not sure if routing is any more dangerous than splitting up indices. If you partition indices incorrectly and one is many times larger than the other, you'll run into similar issues. Perhaps however, with routing it is harder to debug those issues.

That's not my understanding of these larger NOSQL systems (my experience is primarily with Bigtable not Mongo). From what I understand, it will scale indefinitely. The basic idea is that resharding a single node on one of these databases does not require re-sharding the entire system. So as data is added, and one node gets full, it can be split up without affecting other shards in the database. Another feature is that when you read, you are not guaranteed to get the most recently written information, this is called eventual consistency, and allows this class of databases to scale far more easily than a typical SQL database.

If your goal is to hold the entire index in RAM then that's a fundamentally different question than you're asking and you're making all kinds of assumptions that don't really make any sense. The RAM limitation you're talking about is a Heap limitation, ES makes extensive use of the OS buffer cache to hold much of the data specifically to keep it off the heap and that's only limited by what the hardware can support. Operationally to achieve good performance you have to be able to hold the working set of the index in memory for whatever real-world queries actually require. That's going to be a much smaller number than the total size of the index and that's something very specific to a particular workload. Overall sizing of clusters is a really complicated thing to do that depends on all kinds of parameters related to the specific use case. Trying to conclude that you're limited to a 3TB index because that's all the Heap you have available is simply wrong. ES scales far, far further than that but exactly how far has no answer other than "it depends".

Routing and splitting indices are fundamentally different things. If you get routing wrong you'll get inconsistently sized shards in the same index and when you start dealing with really large shards that creates balance issues in the cluster. When you end up with an index that has one shard that's 20GB and then one that's 500GB you'll pretty quickly figure out why this is so bad.

Bigtable is one of the few systems that has the ability to split shards (or tablets as it calls them). Most other systems don't have that ability due to the properties of the shard keys that are used and that is the case with Elasticsearch. That said, for Elasticsearch this is a sizing and design consideration, not a hard limit on scalability. Would it be nice if it could split shards, absolutely, but it's not as big a scalability limitation as you seem to want to believe it is. Hbase since it's based on the same model as Bigtable can also split shards (which it calls regions) and we have large Hbase environments as well but it's a fundamentally different data model. When I designed the first Elasticsearch based system for a customer I evaluated and tested all the major distributed data stores and there's nothing that scales as well while still providing such flexible query/indexing facilities like ES. In particular, as far as Hbase is concerned you give up a huge amount of query power to get the ability to auto-split shards and even with that ability you still can't scale endlessly because you hit other limitations. It's all just tradeoffs, there's no magic.

Kimbro

Glad that Kimbro finally commented on the OS cache. The HEAP holds many other caches, but the shard is held in os file cache. The os file cache is not subject to GC or the 32GB theoretical HEAP limit. A node with 256GB of memory, has around 222 GB of fs cache for shards, with a 32 GB HEAP.

While it is ideal to keep shards in memory, it is not imperative. I recently helped a client with around 300GB/shard. Their primary reason for the engagement was not scale or performance, but they were running pre 1.x ES, and needed to upgrade. My first recommendation was to reduce the shard size. It is not required to have the active dataset in RAM. Of course an index not in RAM will be significantly slower, but ES is still fast.

If the active data is not in RAM, the performance of the disk becomes critical. This is one of the reasons, aside from ingestion, that SSD based clusters are becoming common. This client like many of us had time based data, which is easy to segment, and most queries run against the most recent data, I.E. last 24 hours, last 7 days. Occasionally someone needs to run a 90 day or 1 year query and it is just not possible to keep that much data in RAM for most budgets. So using SSD keeps these queries fast, while not incurring the expense of RAM and additional nodes.

There are so many issues with this single index premise, many have been covered, but one that does not seem to have been mentioned directly, is growth. As was mentioned ES cannot split shards like SOLR can. So when your index with 100 shards, needs to grow beyond 100 nodes, you will need to reindex everything, in one shot. Also if your shards grow beyond the theoretical 32GB ceiling you have placed on them, you will need to increase the number of shards. However, if you are using a partitioning scheme then you don't have the same issues. You can reindex partitions independently if needed. If you are using time based data you create new indexes based on time period the data naturally partitions.

It is unreasonable to expect ES to identify the ideal partitioning of your data. ES has done it's part in that regard by providing the Index over Shards concept. The shards partition the data arbitrarily and are grouped into a logical dataset. If you understand that an index is nothing more than a logical grouping of shards, and that ES provides other mechanisms to group shards such as aliases, then it starts to make sense that there is little difference between a single index that has 100 shards and 10 indexes that each have 10 shards from the perspective you are looking at this. Yes you would have to do a little work upfront to distribute incoming documents, but that does not require an army of librarians as you have asserted. Typically with some effort a natural key, which in result will improve query times, can be identified. If not you can do similar to what ES does and hash. If you use a consistent hash, you will be able to minimize the impact of the eventual resizing, as fewer documents will need to move shards. Explaining this fully is well beyond the scope of this reply, but the point is that partitioning need not be a manual process.

As already said Mongo does require partitioning determined by the user. Partitioning and finding the right natural key was one of the major topics in all of the MongoDB World events that I attended. Like ES routing, your MongoDB key has significant distribution and performance implications. To scale beyond a single MongoDB node, the same exercise is required, a shard key must be identified and the data must be sharded. This must be defined by the user equivalent to ES routing, with the same caveats. The big difference with ES is the first stage of this happens automatically.

I understand your replica comment and 2x RAM, keeping 2x the number of shards in memory requires 2x memory. However, this does improve concurrent query performance. In heavy query environments it is not uncommon to have replica = number of nodes, so that every node has every shard. The point that everyone else was making I think is replicas do not need to be in RAM, if the primary is in RAM, so replica=1 does not necessarily mean you need 2x RAM.

To summarize, the 100 node limit is incorrect, there are many running larger clusters, this is a networking consideration and there are network infrastructures that will support the traffic of more than 100 nodes. The 31GB per shard limit is incorrect and is only truly limited by your disk size, but is performance limited by RAM less HEAP. People are running clusters with 10000s of shards. Index is a logical grouping of shards. If a cluster can scale to 10000s of shards, then a single index can also. It is trivial to scale an ES cluster beyond 3TB, don't do this in a single index. Partitioning the data into is trivial, and does not require manual work. Searching across multiple indexes is trivial, typically automatic.

1 Like

I did not suggest that the goal is to keep the entire index in RAM. My point is about scaling. If you double the size of your index without doubling the amount of RAM in the cluster, performance will degrade. So scaling up while keeping performance constant necessarily requires adding more RAM. This is true regardless if 5% or 100% of your index is in RAM. The issue is that if there is a limit to how much RAM can be devoted to the index, scaling eventually hits a wall.

This is a good point. It's unclear to me what exactly is stored in ES's heap and what is in the OS cache. Assuming you had a node with 1TB of RAM, where only 31GB goes to ES's heap, whether ES/Lucene makes good use of the remaining RAM through the OS cache probably depends on the types of queries you're performing. This article suggests it does, but only if you limit the types of aggregations you perform. That said, if this is indeed the case, then it makes a strong case for vertically scaling each node far beyond 31GB (does not impact horizontal scaling though).

Yes, I know they are different. My point is that both are a form of logical partitioning, and the hard part is to come up with a partitioning that makes sense for your project.

There's no question that ES has far more capabilities than these other databases, but it's always good to know what the limits of the system are.

I agree that in some use-cases (log data in particular), time based partitioning is simple and makes a lot of sense. But other cases are far more difficult.

Book search is a good example because time based partitioning makes little sense (users often run searches across time periods). For books, you would probably have to partition into subject areas, and as each subject area grows too large for the single index, you would have to break that subject down into smaller sub-subjects, and then sub-sub-subjects, and so forth. This is exactly what librarians do. The problem is that if your data is growing exponentially, then you will also need to re-partition your data exponentially.

I use ES for legal documents, which is similar to books in that there is no "trivial" way to partition them.

Agreed. It isn't likely exactly 2X, probably a bit less, although how much less depends on usage.

I never said that there is a 100 node limit. I said that without partitioning (routing or separate indices) and rep 0, there is probably a 100 node limit. In my single index hypothetical, every query has to be sent to 100 nodes, and the results have to be merged together. It seems unlikely that such a system will be performant.

Agreed, but probably not to the 1TB extent, though I understand the hypothetical.

I assumed this was the case, and that there are situations that there is no natural key. In that case there is always the option to hash against an arbitrary key, and thus my comments about consistent hashing. However, I suspect in the case that you have described, that there is the possibility to use time based partitioning, based on the date a document was published. That is part of the issue you will face is the need to continually add new documents as they are published. No matter your selected shard count, at some point you are going to have too much data per shard if you stick with a single index. As documents are not published in the past, creating new indexes going forward will alleviate that issue. You can allow your index to grow until it reaches the desired size, cut it off and create a new index. Using an alias that points to the current index will alleviate the need for everything to know where to index. Using a separate alias will allow searches to span the full dataset. If the publish date happens to be in the query, then you will have a huge boost to performance as you eliminate shards which do not have that date.

I do understand your point though, but reject the assertion that there is any other system that will resolve that for you. It is only an issue, if you want to optimize your queries by reducing the amount of data covered. The scheme of sub-dividing as you have suggested would accomplish that. However every data storage system requires you to teach it about your data. Typically this is by creating indexes on your tables or documents in mongo, otherwise all documents must be consulted.

Searches across 100s of nodes and across 10000s of shards can be performant. They are more performant if you are able to determine ways to divide the data that can be applied to the queries that are run. A search on 100 nodes with 100 shards will be slower than a search on 10 nodes with 10 shards, it will not be 10x slower, more like 10% slower, primarily as a larger data set is collected, and because a larger number of messages are exchanged.

This is the default behavior of Elasticsearch, it hashes off an arbitrary document id.

The issue is not indexing, but searching. Searches in this domain regularly span long time periods, so even partitioning the collection into periods of time will still require accessing every partition for every search.

I never asserted that another system was capable of doing what ES does at a larger scale. My point was only that there are other systems that can scale much larger than ES without logical partitioning of data. These systems can rely solely on arbitrary document ids and achieve this scale by forgoing many features that ES offers.

The two bottlenecks are (1) messaging on the network and (2) the process of merging results from 100 nodes. With 10x more messages on a network, the network could potentially become 10x slower. When combining query results, the results are likely to be smaller, but there are 10x as many to merge. So certain parts of the query process are going to be much slower. However, it's hard to guess how dominant these parts are unless you've tried it, which was part of the motivation for me asking this question in the first place.

This is a query hitting 672 nodes spread across 8 clusters searched through a tribe node, total size over 800TB.

{"took":271,"timed_out":false,"_shards":{"total":4960,"successful":4960,"failed":0}

271ms.

Here's the same query going to one of those 8 clusters which is 84 nodes.

{"took":37,"timed_out":false,"_shards":{"total":680,"successful":680,"failed":0}

So 37ms for roughly 100TB of data.

Obviously the data set is partitioned but because of how ES works it's still searchable as a single entity even across clusters and the partitioning actually gives you significant power to optimize your access patterns. For example this is time series data, so a single day in this particular case is about 11TB and the same query for just one day.

{"took":5,"timed_out":false,"_shards":{"total":80,"successful":80,"failed":0}

FYI, the largest single cluster we've built was 168 nodes using ES 1.7 which definitely was pushing the limits of that version of ES. With ES 2 I've heard of cases running much larger clusters but based on what we've learned from running production systems it's better to keep the actual clusters smaller if it can be accommodated.

We have also stored very large data sets lacking natural partitions in multiple clusters like this by using hashed keys like @aaronmefford mentioned. It's basically manual sharding but because ES allows you to join everything back together using tribe nodes it appears as a single entity for search so it's really not a major issue.

To me partitioning the data is just part of the process of scaling regardless of the system used and once you start trying to deal with all the other issues you'll encounter with so much data you'll quickly realize why it is your friend not something to be avoided.

Kimbro

1 Like

Specifically around your use case @Michael_Sander, law cases have dates attached to them right. You can also add fields to the documents to help categorise them (and even use percolator for that part).

But despite the larger document size, it can still be extrapolated out to a time-based use case :slight_smile:

This is incredible performance, it's almost hard to believe. I feel like just pinging 672 nodes would take more than 271ms. Your sure that all of the nodes are being contacted for this search? If this is indeed the case, then it certainly shows that ES handles horizontal scaling incredibly well.

Interesting, I didn't read much about scaling in the 2.x release notes. I wonder what about ES2 improves scaling.

Yes, but the dates are relatively meaningless as far as partitioning for queries because users often search across time periods.

Yes we can, but this is where it starts getting complex.

Yes it's hitting all the nodes, I left in the shard counts to show the scaling of the work being done. Welcome to the wonders of async network APIs. The issues in scaling ES all manifest in other areas and there are plenty of them but it certainly does scale.

In ES 1.7 search and indexing worked fine but the cost of a large cluster manifested in the distribution of cluster state changes to all the nodes. This operation is relatively rare but in 1.x it distributed the entire cluster state every time, in 2.x it distributes just the changes. In 1.7 as cluster state would grow the distribution got more and more expensive and rapid state changes (like deleting a bunch of indices) could really disrupt the cluster. In 2.x it's supposed to be much better at running larger clusters but I've never pushed it as I find it more manageable to still keep the size down.

Actually the dates might be very useful and you answered why, "users often search" keyword being "often". ES has very powerful capabilities in this area which can be used as an additional tool for optimization of query performance. Just partition your data into some reasonable boundaries, monthly, daily, hourly, whatever works based on the particular use case. So you have indices named legalstuff-2016.09.29 or whatever then you can restrict your searches to just the range that matters. Specific index for today, legalstuff-2016.09.* for all of Sept, legalstuff-2016.* for all of 2016 or whatever. It's a design point for your particular application how you choose the partitions but it gives you a great tool for optimizing the response time for your users. If you're lucky and 90% of queries only care about the last 90 days or something then you can design to make that case optimal while still enabling deeper searches when needed.

And if date truly isn't the right partition for the legal use case then look at how the users do their work and see if there is another boundary that could be leveraged to provide the optimization. The key thing to understand is that to ES the parameter that really matters is the number of shards, if those shards are in one index or many indices or even many clusters it's just a detail and this gives you power that you can leverage if you choose to. Worst case if there's absolutely no logical partition that is useful then you can certainly run one giant index and it will work it's just harder to manage. So what's best depends entirely on the specific use case.

Kimbro

1 Like

@kstaken Can you tell what type of networking is between the servers (1gbs,10,40) and do you put them in 1 rack or close and how many search-queries/second ?

Two isolated 10G networks one for transport and one for management/HTTP, spans many racks with 40G interconnections.

@kstaken
Hi Kimbro
Can you please share how did you arrive at number of nodes, shards and the hardware required?
We have a similar use case and looking for indexing event data (~ 1TB) per day. where queries can span for last 30 to 90 days. It is very hard to find sizing information, guidelines and how people calculate number of nodes, shards (size size) and required hardware for large ES clusters (~100 TB of data).
Thanks