I am curious if any ES users have scaled their systems to reliably handle several terabytes of data or beyond. My back-of-the-napkin calculations below suggest that an ES system is limited to about 3TB of RAM, which makes extending far beyond that much in data/index/etc. difficult or not possible.
Edit: To clarify, I mean scaling a single index without using routing beyond 3TB of RAM may not be possible. This puts a practical limit on the max index size.
In particular:
- Shards: Adding more shards allows you to spread the work across more nodes. However, too many nodes increases network usage and latency. Each application is different, but the documentation makes clear that 1000 shards is far too much. I have not tested it, but I get the impression that you start getting serious performance issues with 50+ shards.
- Memory: You can increase memory in each node up to 31GB, after which you pay a price in RAM due to 64 bit pointers. However, there is another price for breaking the 31GB barrier: longer garbage collections. On my 31GB machines, it easily takes 5 to 30s to perform an old 30GB GC. I imagine it could take minutes to collect 100GB, which is not acceptable. Therefore, you are effectively limited to roughly 31GB anyway.
- Routing: Setting up routing improves performance, but also limits what searches can be run quickly. My understanding is that queries across many routes will not improve. But more to the point, requiring users to logically divide their data is the opposite of scalability.
- Replicas: Adding replicas will artificially increase the total RAM in the system and may improve performance (could hurt from increased network usage), but adding replicas does not have a substantial impact on scalability.
Putting this together, assuming each 31GB node has one shard and 100 shards is the practical limit, then an ES system is limited to 3TB of RAM (=31GB * 100 Shards / 1024). A system with 3TB of RAM may be able to handle 3TB of data depending on the index structure. However, it would probably need to avoid advanced ES features and scaling significantly beyond 3TB seems harder to imagine.
If I missing something here, I would really like to hear about it. I would also like to hear of experiences approaching several TB of data. Do you ditch the advanced ES functionality? Do you live with longer than snappy query times? Is logically dividing data through routes or separate indices mandatory? Or are some of my assumptions above incorrect?