Group and filter results

PhilippC · July 18, 2016, 3:09pm

I have an index of DocumentVersions - different versions of a document.

{versionId: 1, documentId:1, ...},
{versionId: 2, documentId:1, ...},
{versionId: 3, documentId:2, ...},
{versionId: 4, documentId:2, ...},
...

By default users should only receive the most recent version of a document.

In Solr i used ResultGrouping: group.field=documentId, group.limit=1, group.sort=versionId desc.

How can i achieve something similar with elasticsearch?

I tried "top-hits-aggregation" (bucket for documentId, order by versionId desc and size 1). That worked, but i ran into problems with paging as for example size effects number of documents, not number of buckets

mikemccand · July 18, 2016, 3:44pm

Maybe using terms aggregation (on your documentId field), with top_hits sub-aggregation?

PhilippC · July 18, 2016, 5:12pm

Thank you for your tip, but thats what i wrote i already tried.
=> ran into problems with paging.

mikemccand · July 18, 2016, 6:04pm

Woops, sorry, I missed that you had already tried this.

But: size should control the number of buckets ... not sure why you're seeing otherwise.

You could maybe try parent/child documents instead? Index a parent document for each unique documentId, and then one child document for each versionId. https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html

PhilippC · July 18, 2016, 7:09pm

Documentation says: "[...] size - The maximum number of top matching hits to return per bucket.[...]"

Your parent/child hint is interesting. I did not know about that yet.
But how would a grouping query look like if i model my data that way?

Lets say a query matches multiple versions(children) of a multiple documents(parents).
How would i make sure only the most recent versions are returned?

mikemccand · July 18, 2016, 8:02pm

Oh, I see: I was talking about the size parameter for the terms aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

That should let you control how many unique documentId you get back?

Whereas the size parameter to the top_hits sub aggregation says how many top documents to keep in each bucket.

Using parent/child, I think you could sort on versionId? But, that probably won't work for you in general, i.e. you typically would want to sort on relevance or something else that's child-specific? Do you really need to keep the old child versions around?

PhilippC · July 19, 2016, 9:46am

You are right. By setting size for the terms aggregation i can control how many documents (unique documenId) i get back. But that way i loose the number of TotalHits.

I somehow need to "rewrite the aggregation" to a standard result.

TotalHits should represent the number of documents (unique documentId) - not versions
Size should limit the number of documents returned per page - not versions

Or is it possible to filter results to only return one version per document? Something like a SQL-Subselect.

argh... i am really stuck here. Don't want to switch back to solr again

Topic		Replies	Views
Get the latest document version and aggregate the results Elasticsearch	1	1426	July 5, 2017
What should the bucket path be with a top hits Elasticsearch	1	244	April 28, 2023
What's a good strategy for getting one or as many document per group depending on the group Elasticsearch	6	1819	December 12, 2017
Conditional filtering of documents based on other documents Elasticsearch	1	332	July 6, 2017
Elasticsearch top hits aggregation not working as expected Elasticsearch	4	719	July 5, 2017

Group and filter results

Related topics