Group and filter results

I have an index of DocumentVersions - different versions of a document.

{versionId: 1, documentId:1, ...},
{versionId: 2, documentId:1, ...},
{versionId: 3, documentId:2, ...},
{versionId: 4, documentId:2, ...},
...

By default users should only receive the most recent version of a document.

In Solr i used ResultGrouping: group.field=documentId, group.limit=1, group.sort=versionId desc.

How can i achieve something similar with elasticsearch?

I tried "top-hits-aggregation" (bucket for documentId, order by versionId desc and size 1). That worked, but i ran into problems with paging as for example size effects number of documents, not number of buckets

Maybe using terms aggregation (on your documentId field), with top_hits sub-aggregation?

Thank you for your tip, but thats what i wrote i already tried.
=> ran into problems with paging.

Woops, sorry, I missed that you had already tried this.

But: size should control the number of buckets ... not sure why you're seeing otherwise.

You could maybe try parent/child documents instead? Index a parent document for each unique documentId, and then one child document for each versionId. https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html

Documentation says: "[...] size - The maximum number of top matching hits to return per bucket.[...]"

Your parent/child hint is interesting. I did not know about that yet.
But how would a grouping query look like if i model my data that way?

Lets say a query matches multiple versions(children) of a multiple documents(parents).
How would i make sure only the most recent versions are returned?

Oh, I see: I was talking about the size parameter for the terms aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

That should let you control how many unique documentId you get back?

Whereas the size parameter to the top_hits sub aggregation says how many top documents to keep in each bucket.

Using parent/child, I think you could sort on versionId? But, that probably won't work for you in general, i.e. you typically would want to sort on relevance or something else that's child-specific? Do you really need to keep the old child versions around?

You are right. By setting size for the terms aggregation i can control how many documents (unique documenId) i get back. But that way i loose the number of TotalHits.

I somehow need to "rewrite the aggregation" to a standard result.

  • TotalHits should represent the number of documents (unique documentId) - not versions
  • Size should limit the number of documents returned per page - not versions

Or is it possible to filter results to only return one version per document? Something like a SQL-Subselect.

argh... i am really stuck here. Don't want to switch back to solr again :unamused: