I am a total newbie so forgive me if this is answered in the latest documentation and I just haven't seen it.
Is there native (configurable) support for search result grouping by N user selected fields? I have an index with 100 million docs and roughly 200 fields. I'd like to run a search (even a "get all" search) with a field set I choose (say up to 20) and get a unique result set "grouped" by the field set I chose at runtime. I expect high cardinality. I also require user defined sorts and some basic stats with each unique result (count and sum(field).
I've read that the latest Solr can support this but only up to a max of N=4.
Sure. So the index has 200 fields and 100M docs. The user wants to browse the data to look for patterns based on some key fields they expect to be relevant. So they run a search like q=* and choose a field list of A, B, C, D, E, F, G, H, I, and J, and a sort of B ascending. The user expects the results to be grouped by those 10 fields, meaning the docs are rolled up where there are exact value matches for the 10 fields. The user can see a count of how many docs are in each group, and the sum of a number field X.
Count, Sum, A, B, C, ... (headers)
10, 45.678, Jim, 10 Main Street, 1/1/2016, ... (there are 10 docs in the index that have the same values for the 10 fields requested)
35, 34545.56, Jane, 11 Main Street, 2/3/2015, ...
This is possible in ES using the scripting support of the Terms Aggregation. E.g. the user selects the fields they are interested in, that list is passed to the terms agg as a script, and the script extracts the each of the field values for aggregating.
The downside to this approach is that it'll be slower, since the script won't be able to use certain internal optimizations. But it'll do the job.
If you know what fields are selectable, and in what combinations, you can use the copy_to functionality to create a special "combined" field to aggregate
Thanks Zachary. I see the section "Multi-field terms aggregation" in the docs now. I have a couple follow up questions.
Do field terms hold on to deleted values? So I have a field called Animal and it's indexed terms are "cat", "dog", "bird". If I delete all the docs that contain Animal=bird, then soon after (seconds, minutes?) ask for all the terms on Animal will I still see "bird"?
Most fields in my indexes are high-cardinality (can be in the 7 figures). I read that setting "size" to 0 is bad - "Don’t use this on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.” From your experience would it be reasonable to set the "size" to the shard doc count, say 10,000,000 while also using script to group by 10 or more fields? Of course load testing is the only way to answer performance questions, but does this scenario sound performant to you (for real time user searches)?
Potentially, for a brief moment. Elasticsearch offers "near real-time" search, meaning that search results "lag" behind real-time by a certain amount. Under default settings, this lag is one second, as defined by the refresh_interval setting. So if you delete all the documents containing "bird" (and can do that in under one second) then quickly query before the 1s threshold, you'll see some "bird" docs returned.
From your experience would it be reasonable to set the "size" to the shard doc count, say 10,000,000 while also using script to group by 10 or more fields?
This would give me pause as well. What you're telling ES is to calculate the top 10m documents even though you just want the top 10. The larger shard_size exists to help provide better accuracy to your top n results by collecting a larger candidate pool before pruning locally.
But under most data distributions (e.g. a zipf distribution or power law, where the top results are highly prevalent and it quickly falls into a long-tail of low occurrence terms), you don't need many more results to guarantee the top n are accurate, because of the quick falloff. E.g. shard_size of 50 or 100 is sufficient to guarantee the top 10 are accurate under a normal power law.
The problem with large shard_sizes is that it adds a ton of memory overhead (you have to maintain a priority queue 10m deep). And this is done per-shard, so if you have multiple shards on one node, you'll have higher overhead than you may expect. And it's a lot of wasted work really.