Aggregations based on filtered nested fields


(Will Cosgrove) #1

I'm trying to construct a query with some aggregations based on filters of nested fields. Let me briefly explain my data models so the examples are easier to follow.

I've got a Winery model, which has many Wine models. The Wines that belong to a Winery are Wines that the Winery produced. I'm going to simplify my models for the sake of a simpler example. Lets assume there are no attributes of a Winery, other than an a unique ID. And lets assume that the only attributes of a Wine are a price, and a score (representing a critic's score).

When we search over Wineries, I want to return aggregations that tell me how many wineries match for each given price range and each given score range.

Here's an example of a search that we use, that is not returning the information I'm wanting.

GET wineries_development/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "from": 0,
  "timeout": "11s",
  "_source": false,
  "post_filter": {
    "nested": {
      "path": "wines",
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "wines.scores": {
                  "gte": 100
                }
              }
            },
            {
              "range": {
                "wines.price": {
                  "lt": 3100
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggregations": {
    "wines.price": {
      "filter": {
        "nested": {
          "path": "wines",
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "wines.scores": {
                      "gte": 100
                    }
                  }
                }
              ]
            }
          }
        }
      },
      "aggs": {
        "price": {
          "range": {
            "field": "wines.price",
            "keyed": true,
            "ranges": [
              {
                "to": 3100,
                "key": "<30"
              },
              {
                "from": 3100,
                "to": 5100,
                "key": "31-51"
              },
              {
                "from": 5100,
                "to": 10100,
                "key": "51-101"
              },
              {
                "from": 10100,
                "to": 25100,
                "key": "101-251"
              },
              {
                "from": 25100,
                "key": "251+"
              }
            ]
          }
        }
      }
    },
    "wines.scores": {
      "filter": {
        "nested": {
          "path": "wines",
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "wines.price": {
                      "lt": 3100
                    }
                  }
                }
              ]
            }
          }
        }
      },
      "aggs": {
        "scores": {
          "range": {
            "field": "wines.scores",
            "keyed": true,
            "ranges": [
              {
                "from": 100,
                "key": "100"
              },
              {
                "from": 97,
                "to": 100,
                "key": "97-99"
              },
              {
                "from": 94,
                "to": 97,
                "key": "94-96"
              },
              {
                "from": 91,
                "to": 94,
                "key": "91-93"
              },
              {
                "to": 91,
                "key": "<90"
              }
            ]
          }
        }
      }
    }
  }
}

Which is returning this:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "wines.scores": {
      "doc_count": 741,
      "scores": {
        "buckets": {
          "100": {
            "from": 100,
            "doc_count": 2
          },
          "<90": {
            "to": 91,
            "doc_count": 142
          },
          "91-93": {
            "from": 91,
            "to": 94,
            "doc_count": 179
          },
          "94-96": {
            "from": 94,
            "to": 97,
            "doc_count": 72
          },
          "97-99": {
            "from": 97,
            "to": 100,
            "doc_count": 18
          }
        }
      }
    },
    "wines.price": {
      "doc_count": 21,
      "price": {
        "buckets": {
          "<30": {
            "to": 3100,
            "doc_count": 2
          },
          "31-51": {
            "from": 3100,
            "to": 5100,
            "doc_count": 4
          },
          "51-101": {
            "from": 5100,
            "to": 10100,
            "doc_count": 10
          },
          "101-251": {
            "from": 10100,
            "to": 25100,
            "doc_count": 12
          },
          "251+": {
            "from": 25100,
            "doc_count": 19
          }
        }
      }
    }
  }
}

So this query says: "Get me all the wineries that have a wine that is both under $31 and has a score of 100". No such wines exist, and so no wineries should be returned. But if we look at the aggregation counts for the "100" score bucket, or the "<31" price bucket, we see that it gives a count of 2.

I understand why it's giving a count of 2. There are Wineries that exist that have a wine that is scored 100, and have a different wine that is priced under $31. Two separate wines.

The nested query that actually gets the results in the post_filter is doing the correct thing. Looking for a winery that has a single wine that matches the given criteria. But the aggregations have a filter aggregation that does the filter on the other field, and then has a range sub-aggregation which no longer factors in the filtered out wines in the filter aggregation.

I know that this is not a bug, I'm just not doing the right thing or it's not possible to do in the way that I want.

If anyone has any ideas on how to achieve this kind of aggregation I would love to hear about it!

Thanks for taking a look!


(Adrien Grand) #2

I believe you are after something like that:

GET wineries_development/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "from": 0,
  "timeout": "11s",
  "_source": false,
  "post_filter": {
    "nested": {
      "path": "wines",
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "wines.scores": {
                  "gte": 100
                }
              }
            },
            {
              "range": {
                "wines.price": {
                  "lt": 3100
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggregations": {
    "wines": {
      "nested": {
        "path": "wines"
      },
      "aggregations": {
        "wines.price": {
          "filter": {
            "range": {
              "wines.scores": {
                "gte": 100
              }
            }
          },
          "aggs": {
            "price": {
              "range": {
                "field": "wines.price",
                "keyed": true,
                "ranges": [
                  {
                    "to": 3100,
                    "key": "&lt;30"
                  },
                  {
                    "from": 3100,
                    "to": 5100,
                    "key": "31-51"
                  },
                  {
                    "from": 5100,
                    "to": 10100,
                    "key": "51-101"
                  },
                  {
                    "from": 10100,
                    "to": 25100,
                    "key": "101-251"
                  },
                  {
                    "from": 25100,
                    "key": "251+"
                  }
                ]
              }
            }
          }
        }
      },
      "wines.scores": {
        "filter": {
          "range": {
            "wines.price": {
              "lt": 3100
            }
          }
        },
        "aggs": {
          "scores": {
            "range": {
              "field": "wines.scores",
              "keyed": true,
              "ranges": [
                {
                  "from": 100,
                  "key": "100"
                },
                {
                  "from": 97,
                  "to": 100,
                  "key": "97-99"
                },
                {
                  "from": 94,
                  "to": 97,
                  "key": "94-96"
                },
                {
                  "from": 91,
                  "to": 94,
                  "key": "91-93"
                },
                {
                  "to": 91,
                  "key": "&lt;90"
                }
              ]
            }
          }
        }
      }
    }
  }
}

Putting the nested aggregation at the top makes your aggregation work at the nested level rather than the top level.


(Will Cosgrove) #3

Oh, interesting! Ok, so does that then mean that the counts that I'm getting back are now counts of the nested documents? IE: The aggregation values are now counts of wines rather than wineries?


(Adrien Grand) #4

Yes. You can use the reverse_nested aggregation if you want wineries counts back.


(Will Cosgrove) #5

Out of curiosity, say I had winery_id (the id of the parent document) as a field of the nested wine document. Do you know off the top of your head if it would be more performant to do a cardinality aggregation on that winery_id field, or the reverse_nested?

And thank you so so much for your help. I have been stuck on this problem for about a month.


(Adrien Grand) #6

reverse_nested will be more efficient.


(Will Cosgrove) #7

Awesome, thank you again so much for your help!


(Adrien Grand) #8

I'm not sure of the scope of your project, but if there ever is a website that uses this functionality to expose wine search, please share the link. :slight_smile:


(Will Cosgrove) #9

Definitely! We're working on a major rewrite of the front end UI and searching capability. We're shooting to have it wrapped up in another month or two. I will post a link back here when it's up and live!


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.