Hello,
We’re looking to use ES for reporting and analytics in addition to search, however I’ve been hitting some limits when running some deeply nested aggregates with large datasets. I’ve tried various approaches in order to overcome this. Unfortunately, we’re limited to one node at the moment, and I understand that there is a jvm limitation of ~32GB.
I’m working with the latest version 6.1.2, using NEST/Elasticsearch.Net on Windows Server 2012 with 32GB and ~400GB disk space.
The data are financial transactions, essentially fact tables that include potentially a large number of dimensions and metrics and are indexed into Elasticsearch. Our requirement is to be able to generate dynamic reports that allow a user to select potentially up to ~20+ “group by” terms. Each fact table is on it’s own index with 3-4 shards, from ~650+mb - 1GB, 600k+ upwards towards 4Million docs each shard.
I’ve gone through various articles on trying to figure out how to optimize, like using doc values, tune various config settings, etc. and I’ve read that it is highly discouraged to do deeply nested aggregations due to the combinatorial problem.
As suggested here
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
I’m currently trying to use the terms aggs partitioning to see if I can “page” through the large datasets. Even if I’m able to aggregate one partition at a time, is my understanding correct that the result would not be sorted across partitions?
I’d like to get a better understanding as to how those with terabytes of data are able to (if indeed they are) run large (aka deeply nested) queries on limited memory using ES. Are other technologies used in addition to Elasticsearch in some intermediary phase? Do I restructure the data? Or is it merely a matter of growing and scaling the cluster? Then, in that case, how/can it really manage to aggregate terabytes of data across nodes?
Any pointers to docs/explanations on the inner workings, and eventually solutions in handling large aggregate queries whose size goes well beyond the JVM limit is greatly appreciated.
(The alternative is that we fall back on using cubes on MS SSAS, but we’re hoping to leverage ES as we’re already using it for search).
Thanks very much.