Aggregations based on filtered nested fields

willcosgrove · September 5, 2017, 2:05pm

I'm trying to construct a query with some aggregations based on filters of nested fields. Let me briefly explain my data models so the examples are easier to follow.

I've got a Winery model, which has many Wine models. The Wines that belong to a Winery are Wines that the Winery produced. I'm going to simplify my models for the sake of a simpler example. Lets assume there are no attributes of a Winery, other than an a unique ID. And lets assume that the only attributes of a Wine are a price, and a score (representing a critic's score).

When we search over Wineries, I want to return aggregations that tell me how many wineries match for each given price range and each given score range.

Here's an example of a search that we use, that is not returning the information I'm wanting.

GET wineries_development/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "from": 0,
  "timeout": "11s",
  "_source": false,
  "post_filter": {
    "nested": {
      "path": "wines",
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "wines.scores": {
                  "gte": 100
                }
              }
            },
            {
              "range": {
                "wines.price": {
                  "lt": 3100
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggregations": {
    "wines.price": {
      "filter": {
        "nested": {
          "path": "wines",
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "wines.scores": {
                      "gte": 100
                    }
                  }
                }
              ]
            }
          }
        }
      },
      "aggs": {
        "price": {
          "range": {
            "field": "wines.price",
            "keyed": true,
            "ranges": [
              {
                "to": 3100,
                "key": "<30"
              },
              {
                "from": 3100,
                "to": 5100,
                "key": "31-51"
              },
              {
                "from": 5100,
                "to": 10100,
                "key": "51-101"
              },
              {
                "from": 10100,
                "to": 25100,
                "key": "101-251"
              },
              {
                "from": 25100,
                "key": "251+"
              }
            ]
          }
        }
      }
    },
    "wines.scores": {
      "filter": {
        "nested": {
          "path": "wines",
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "wines.price": {
                      "lt": 3100
                    }
                  }
                }
              ]
            }
          }
        }
      },
      "aggs": {
        "scores": {
          "range": {
            "field": "wines.scores",
            "keyed": true,
            "ranges": [
              {
                "from": 100,
                "key": "100"
              },
              {
                "from": 97,
                "to": 100,
                "key": "97-99"
              },
              {
                "from": 94,
                "to": 97,
                "key": "94-96"
              },
              {
                "from": 91,
                "to": 94,
                "key": "91-93"
              },
              {
                "to": 91,
                "key": "<90"
              }
            ]
          }
        }
      }
    }
  }
}

Which is returning this:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "wines.scores": {
      "doc_count": 741,
      "scores": {
        "buckets": {
          "100": {
            "from": 100,
            "doc_count": 2
          },
          "<90": {
            "to": 91,
            "doc_count": 142
          },
          "91-93": {
            "from": 91,
            "to": 94,
            "doc_count": 179
          },
          "94-96": {
            "from": 94,
            "to": 97,
            "doc_count": 72
          },
          "97-99": {
            "from": 97,
            "to": 100,
            "doc_count": 18
          }
        }
      }
    },
    "wines.price": {
      "doc_count": 21,
      "price": {
        "buckets": {
          "<30": {
            "to": 3100,
            "doc_count": 2
          },
          "31-51": {
            "from": 3100,
            "to": 5100,
            "doc_count": 4
          },
          "51-101": {
            "from": 5100,
            "to": 10100,
            "doc_count": 10
          },
          "101-251": {
            "from": 10100,
            "to": 25100,
            "doc_count": 12
          },
          "251+": {
            "from": 25100,
            "doc_count": 19
          }
        }
      }
    }
  }
}

So this query says: "Get me all the wineries that have a wine that is both under $31 and has a score of 100". No such wines exist, and so no wineries should be returned. But if we look at the aggregation counts for the "100" score bucket, or the "<31" price bucket, we see that it gives a count of 2.

I understand why it's giving a count of 2. There are Wineries that exist that have a wine that is scored 100, and have a different wine that is priced under $31. Two separate wines.

The nested query that actually gets the results in the post_filter is doing the correct thing. Looking for a winery that has a single wine that matches the given criteria. But the aggregations have a filter aggregation that does the filter on the other field, and then has a range sub-aggregation which no longer factors in the filtered out wines in the filter aggregation.

I know that this is not a bug, I'm just not doing the right thing or it's not possible to do in the way that I want.

If anyone has any ideas on how to achieve this kind of aggregation I would love to hear about it!

Thanks for taking a look!

jpountz · September 6, 2017, 4:55pm

I believe you are after something like that:

GET wineries_development/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "from": 0,
  "timeout": "11s",
  "_source": false,
  "post_filter": {
    "nested": {
      "path": "wines",
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "wines.scores": {
                  "gte": 100
                }
              }
            },
            {
              "range": {
                "wines.price": {
                  "lt": 3100
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggregations": {
    "wines": {
      "nested": {
        "path": "wines"
      },
      "aggregations": {
        "wines.price": {
          "filter": {
            "range": {
              "wines.scores": {
                "gte": 100
              }
            }
          },
          "aggs": {
            "price": {
              "range": {
                "field": "wines.price",
                "keyed": true,
                "ranges": [
                  {
                    "to": 3100,
                    "key": "&lt;30"
                  },
                  {
                    "from": 3100,
                    "to": 5100,
                    "key": "31-51"
                  },
                  {
                    "from": 5100,
                    "to": 10100,
                    "key": "51-101"
                  },
                  {
                    "from": 10100,
                    "to": 25100,
                    "key": "101-251"
                  },
                  {
                    "from": 25100,
                    "key": "251+"
                  }
                ]
              }
            }
          }
        }
      },
      "wines.scores": {
        "filter": {
          "range": {
            "wines.price": {
              "lt": 3100
            }
          }
        },
        "aggs": {
          "scores": {
            "range": {
              "field": "wines.scores",
              "keyed": true,
              "ranges": [
                {
                  "from": 100,
                  "key": "100"
                },
                {
                  "from": 97,
                  "to": 100,
                  "key": "97-99"
                },
                {
                  "from": 94,
                  "to": 97,
                  "key": "94-96"
                },
                {
                  "from": 91,
                  "to": 94,
                  "key": "91-93"
                },
                {
                  "to": 91,
                  "key": "&lt;90"
                }
              ]
            }
          }
        }
      }
    }
  }
}

Putting the nested aggregation at the top makes your aggregation work at the nested level rather than the top level.

willcosgrove · September 6, 2017, 7:39pm

Oh, interesting! Ok, so does that then mean that the counts that I'm getting back are now counts of the nested documents? IE: The aggregation values are now counts of wines rather than wineries?

jpountz · September 6, 2017, 7:55pm

Yes. You can use the reverse_nested aggregation if you want wineries counts back.

willcosgrove · September 6, 2017, 7:58pm

Out of curiosity, say I had winery_id (the id of the parent document) as a field of the nested wine document. Do you know off the top of your head if it would be more performant to do a cardinality aggregation on that winery_id field, or the reverse_nested?

And thank you so so much for your help. I have been stuck on this problem for about a month.

jpountz · September 6, 2017, 8:24pm

reverse_nested will be more efficient.

willcosgrove · September 6, 2017, 8:24pm

Awesome, thank you again so much for your help!

jpountz · September 6, 2017, 8:27pm

I'm not sure of the scope of your project, but if there ever is a website that uses this functionality to expose wine search, please share the link.

willcosgrove · September 6, 2017, 8:30pm

Definitely! We're working on a major rewrite of the front end UI and searching capability. We're shooting to have it wrapped up in another month or two. I will post a link back here when it's up and live!

system · October 4, 2017, 8:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filter nested field in aggregation if value isn't indexed or is null Elasticsearch	1	296	November 24, 2023
Aggregating on nested fields Elasticsearch	5	380	July 6, 2017
Elasticsearch aggregation doesn't work with nested-type fields Elasticsearch	4	1505	July 5, 2017
How to perform filter based on aggregation of nested fields? (similar to nested sorting) Elasticsearch	1	422	February 28, 2020
Filtering aggregation results based on nested objects in ElasticSearch Elasticsearch	1	144	December 16, 2022

Aggregations based on filtered nested fields

Related topics