By default users should only receive the most recent version of a document.
In Solr i used ResultGrouping: group.field=documentId, group.limit=1, group.sort=versionId desc.
How can i achieve something similar with elasticsearch?
I tried "top-hits-aggregation" (bucket for documentId, order by versionId desc and size 1). That worked, but i ran into problems with paging as for example size effects number of documents, not number of buckets
Documentation says: "[...] size - The maximum number of top matching hits to return per bucket.[...]"
Your parent/child hint is interesting. I did not know about that yet.
But how would a grouping query look like if i model my data that way?
Lets say a query matches multiple versions(children) of a multiple documents(parents).
How would i make sure only the most recent versions are returned?
That should let you control how many unique documentId you get back?
Whereas the size parameter to the top_hits sub aggregation says how many top documents to keep in each bucket.
Using parent/child, I think you could sort on versionId? But, that probably won't work for you in general, i.e. you typically would want to sort on relevance or something else that's child-specific? Do you really need to keep the old child versions around?
You are right. By setting size for the terms aggregation i can control how many documents (unique documenId) i get back. But that way i loose the number of TotalHits.
I somehow need to "rewrite the aggregation" to a standard result.
TotalHits should represent the number of documents (unique documentId) - not versions
Size should limit the number of documents returned per page - not versions
Or is it possible to filter results to only return one version per document? Something like a SQL-Subselect.
argh... i am really stuck here. Don't want to switch back to solr again
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.