I'm using a combination of composite, terms and tophits aggregation. I'd like to sort the results (groups retrieved) based on a document's property (nested field property).
I read about bucket sort, but really not sure if that's the ideal solution for me.
This analysis can get tricky if your data is spread across multiple indices or shards and your grouping terms have many possible values.
The sequence of questions I would normally ask to get to a solution are in this wizard.
Any way to simplify this? I want the documents to be grouped by a key. Under each bucket I want to show the top most hit. And I need to show the 10 buckets per request sorted by a document's property.
This runs on only one index and I currently have only one shard.
That is easier to process if you have one index and one shard.
This is the bit I'm confused by - presumably you're sorting by a single metric (e.g. maxmin or sum of values in the grouped docs). Otherwise you have multiple docs with different values and it's hard to determine what value to sort groups on (first, last, lowest, highest?).
Without describing agg choices can you say what business problem you're trying to solve?
Okay, so here is the crux:
I have many Emails. And each email has multiple paragrahs in it. All the paragrahs will be dumped into my index.
When the user searches for any keyword (within the paragraph), I want the results to be grouped by the Email subject and under each group I want the top matching paragraph. I have achieved till this.
What I want now, is the top result under each bucket to be sorted based on a paragraph's property.
Still more questions than answers here I'm afraid
So you currently have:
- Email subject
- Top matching paragraph
Are these email subjects to be sorted by a relevance score (that would seem to be most useful).
Have you dealt with the duplication problems of different emails with the same subject line?
If by "bucket" you mean "email subject" then which of the many related paragraphs would we choose to sort on? That's why I was suggesting you might need a min/max etc. What is this property you want to sort on? Is it a date? Wouldn't all paragraphs in an email have the same date? Are the "email subjects" actually threads with many emails?
Thanks for all the prompt responses @Mark_Harwood. Appreciate it.
Yea, the emails will be sorted based on the time received. And I haven't seen any duplication till now.
The sorting at the Email subject level needs to be based on the top most paragraph fetched (which again is sorted based on a couple of paragraph properties - Activity Recorded and Time).
Each Email subject will have a unique identifier based on which the grouping is performed.
I saw this. Can I get the overall distinct Email subject count with this (I currently use cardinality for getting the overall count)? Also can I get the number of matching paragraphs under each Email?
Hi @Mark_Harwood, I tried using Field Collapse for my use case. It works per my requirement.
But got a different problem now. The query now takes on an average around 10s for execution.
This is for fetching the top 10 results of close to 3M matches (with 69K unique buckets).
Is there any way I can optimise this?
FYI, I'm not using any inner hits under field collapse.
I'm not sure I can see a single smoking gun there but the way to find it would be by repeatedly simplifying the query until the response times change dramatically
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.