Sorting results from composite aggregation

Vignesh_Theyagarajan · June 29, 2020, 8:44am

I'm using a combination of composite, terms and tophits aggregation. I'd like to sort the results (groups retrieved) based on a document's property (nested field property).
I read about bucket sort, but really not sure if that's the ideal solution for me.

Here is my aggregators query part:

        "aggs": {
            "my_comp_agg": {
              "aggs": {
                "my_terms_agg": {
                  "aggs": {
                    "my_tophits_agg": {
                      "top_hits": {
                        "size": 1,
                        "sort": [
                          {
                            "TimeTicks": {
                              "order": "desc"
                            }
                          }
                        ]
                      }
                    }
                  },
                  "terms": {
                    "field": "my_property",
                    "size": 10
                  }
                }
              },
              "composite": {
                "after": {
                  "my_comp_terms_agg": ""
                },
                "size": 10,
                "sources": [
                  {
                    "my_comp_terms_agg": {
                      "terms": {
                        "field": "my_property"
                      }
                    }
                  }
                ]
              }
            },
            "my_count_agg": {
              "cardinality": {
                "field": "my_property"
              }
            }
          }

Can anyone help me on how do I go about it?

Mark_Harwood · June 29, 2020, 9:21am

This analysis can get tricky if your data is spread across multiple indices or shards and your grouping terms have many possible values.
The sequence of questions I would normally ask to get to a solution are in this wizard.

Vignesh_Theyagarajan · June 29, 2020, 9:36am

Any way to simplify this? I want the documents to be grouped by a key. Under each bucket I want to show the top most hit. And I need to show the 10 buckets per request sorted by a document's property.
This runs on only one index and I currently have only one shard.

Mark_Harwood · June 29, 2020, 9:47am

That is easier to process if you have one index and one shard.

This is the bit I'm confused by - presumably you're sorting by a single metric (e.g. max min or sum of values in the grouped docs). Otherwise you have multiple docs with different values and it's hard to determine what value to sort groups on (first, last, lowest, highest?).

Without describing agg choices can you say what business problem you're trying to solve?

Vignesh_Theyagarajan · June 29, 2020, 10:10am

Okay, so here is the crux:
I have many Emails. And each email has multiple paragrahs in it. All the paragrahs will be dumped into my index.
When the user searches for any keyword (within the paragraph), I want the results to be grouped by the Email subject and under each group I want the top matching paragraph. I have achieved till this.
What I want now, is the top result under each bucket to be sorted based on a paragraph's property.

Mark_Harwood · June 29, 2020, 10:27am

Still more questions than answers here I'm afraid
So you currently have:

- Email subject
    - Top matching paragraph

Are these email subjects to be sorted by a relevance score (that would seem to be most useful).
Have you dealt with the duplication problems of different emails with the same subject line?

If by "bucket" you mean "email subject" then which of the many related paragraphs would we choose to sort on? That's why I was suggesting you might need a min/max etc. What is this property you want to sort on? Is it a date? Wouldn't all paragraphs in an email have the same date? Are the "email subjects" actually threads with many emails?

Have you looked at the field collapsing feature?

Vignesh_Theyagarajan · June 29, 2020, 11:08am

Thanks for all the prompt responses @Mark_Harwood. Appreciate it.

Yea, the emails will be sorted based on the time received. And I haven't seen any duplication till now.

The sorting at the Email subject level needs to be based on the top most paragraph fetched (which again is sorted based on a couple of paragraph properties - Activity Recorded and Time).
Each Email subject will have a unique identifier based on which the grouping is performed.

I saw this. Can I get the overall distinct Email subject count with this (I currently use cardinality for getting the overall count)? Also can I get the number of matching paragraphs under each Email?

Vignesh_Theyagarajan · July 5, 2020, 4:40pm

Hi @Mark_Harwood, I tried using Field Collapse for my use case. It works per my requirement.
But got a different problem now. The query now takes on an average around 10s for execution.
This is for fetching the top 10 results of close to 3M matches (with 69K unique buckets).
Is there any way I can optimise this?
FYI, I'm not using any inner hits under field collapse.

Mark_Harwood · July 6, 2020, 8:27am

Hi Vignesh
One thing you could try is to profile your query

Vignesh_Theyagarajan · July 6, 2020, 8:49am

I ran the profiler @Mark_Harwood.
I could see that, the "advance" step for a boolean query was taking close to 5 seconds.

Vignesh_Theyagarajan · July 6, 2020, 9:03am

Here is my query just for your reference:

    {
      "aggs": {
        "my_count_agg": {
          "cardinality": {
            "field": "field1"
          }
        }
      },
      "collapse": {
        "field": "field1"
      },
      "from": 0,
      "highlight": {
        "encoder": "default",
        "fields": {
          "field2": {}
        }
      },
      "query": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "bool": {
                      "filter": [
                        {
                          "match": {
                            "field2": {
                              "query": "query_text",
                              "boost": 3
                            }
                          }
                        }
                      ]
                    }
                  },
                  {
                    "bool": {
                      "filter": [
                        {
                          "wildcard": {
                            "field2": {
                              "value": "query_text*",
                              "boost": 1
                            }
                          }
                        }
                      ]
                    }
                  },
                  {
                    "bool": {
                      "filter": [
                        {
                          "nested": {
                            "path": "path1",
                            "query": {
                              "bool": {
                                "should": [
                                  {
                                    "bool": {
                                      "filter": [
                                        {
                                          "match": {
                                            "field3": {
                                              "query": "field3_value",
                                              "boost": 9
                                            }
                                          }
                                        },
                                        {
                                          "match": {
                                            "field4": {
                                              "query": "query_text",
                                              "boost": 9
                                            }
                                          }
                                        }
                                      ]
                                    }
                                  },
                                  {
                                    "bool": {
                                      "filter": [
                                        {
                                          "wildcard": {
                                            "field3": {
                                              "value": "field3_value",
                                              "boost": 3
                                            }
                                          }
                                        },
                                        {
                                          "wildcard": {
                                            "field4": {
                                              "value": "query_text*",
                                              "boost": 3
                                            }
                                          }
                                        }
                                      ]
                                    }
                                  }
                                ]
                              }
                            }
                          }
                        }
                      ]
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "field5": {
                        "value": "some_id",
                        "boost": 1
                      }
                    }
                  },
                  {
                    "term": {
                      "field6": {
                        "value": false,
                        "boost": 1
                      }
                    }
                  },
                  {
                    "term": {
                      "field7": {
                        "value": "some_id",
                        "boost": 1
                      }
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "size": 10,
      "sort": [
        {
          "field8": {
            "mode": "max",
            "nested": {
              "nested": {
                "filter": {
                  "bool": {
                    "should": [
                      {
                        "bool": {
                          "filter": [
                            {
                              "match": {
                                "field3": {
                                  "query": "field3_value",
                                  "boost": 9
                                }
                              }
                            },
                            {
                              "match": {
                                "field4": {
                                  "query": "query_text",
                                  "boost": 9
                                }
                              }
                            }
                          ]
                        }
                      },
                      {
                        "bool": {
                          "filter": [
                            {
                              "wildcard": {
                                "field3": {
                                  "value": "field3_value",
                                  "boost": 3
                                }
                              }
                            },
                            {
                              "wildcard": {
                                "field4": {
                                  "value": "query_text*",
                                  "boost": 3
                                }
                              }
                            }
                          ]
                        }
                      }
                    ]
                  }
                },
                "path": "path1"
              },
              "path": "path2"
            },
            "order": "desc"
          }
        },
        {
          "field9": {
            "order": "desc"
          }
        }
      ]
    }

Mark_Harwood · July 6, 2020, 9:35am

I'm not sure I can see a single smoking gun there but the way to find it would be by repeatedly simplifying the query until the response times change dramatically

Vignesh_Theyagarajan · July 6, 2020, 9:38am

Does the number of shards play a role here?
The volume of documents in my index is 180M (data of 60Gb+) and currently only one shard is used.

Mark_Harwood · July 6, 2020, 9:51am

Does the number of shards play a role here?

More shards gives you the ability to use more nodes which gives you more parallelism when servicing queries.

system · August 3, 2020, 9:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sorting aggregation buckets on string field Elasticsearch	7	6725	July 6, 2017
Sort aggregation based on TopHits (ie top 10) average score Elasticsearch	2	695	May 25, 2021
Sort by top_hits aggregation Elastic Search elastic-app-search	1	149	March 18, 2024
How to sort the composite aggregation results by score? Elasticsearch	1	907	October 16, 2018
Composite Aggregation bucket sorting by max score Elasticsearch	6	2140	July 23, 2019

Sorting results from composite aggregation

Related topics