Accessing the value of a Cardinality Subaggregation in a Painless script to filter by the value

Problem: I want to use a Script Query in a Filter Aggregation to create a condition based on the values of subaggregations. I have no idea how to access the values of the subaggregations in a Painless script.

A few curl requests to create the data. The data represents items sold in a shop with the name of the shop and the name of the item stored, as well as a boolean flag as to whether the item is original to that shop.

PUT shop-item
PUT shop-item/_mapping/_doc
{
  "properties": {
    "isOriginalItem": {
      "type": "boolean"
    },
    "shopName": {
      "type": "keyword"
    },
    "itemName": {
      "type": "keyword"
    }
  }
}
POST _bulk
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem":true,"shopName":"Sainsburys","itemName":"Meat"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem":true,"shopName":"Lidl","itemName":"Meat"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : true,"shopName" : "Aldi","itemName" : "Meat"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : false,"shopName" : "Tesco","itemName" : "Meat"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : false,"shopName" : "Sainsburys","itemName" : "Cabbage"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : false,"shopName" : "Lidl","itemName" : "Cabbage"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : true,"shopName" : "Aldi","itemName" : "Cabbage"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : false,"shopName" : "Tesco","itemName" : "Cabbage"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : true,"shopName" : "Morrisons","itemName" : "Cat Food" }
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : false,"shopName" : "Lidl","itemName" : "Cat Food"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : true,"shopName" : "Co-op","itemName" : "Cat Food"}
{"index":{"_index":"shop-item","_type":"_doc"}}
{"isOriginalItem" : true,"shopName" : "Tesco","itemName" : "Cat Food"}

The following aggregation shows the number of unique shops and the total number of shops that an original item is sold in:

GET shop-item/_search
{
  "size": 0,
  "aggregations": {
    "original_items": {
      "filter": {
        "term": {
          "isOriginalItem": {
            "value": true
          }
        }
      },
      "aggs": {
        "group_by_item_name": {
          "terms": {
            "field": "itemName"
          },
          "aggs": {
            "number_of_unique_shops_sold_in": {
              "cardinality": {
                "field": "shopName"
              }
            },
            "total_number_of_shops_sold_in": {
              "value_count" : {
                "field": "shopName"
              }
            }
          }
        }
      }
    }
  }
}

Where I'm stuck is I want to only create buckets in group_by_item_name if the following condition is met:

number_of_unique_shops_sold_in.value > (0.95 * total_number_of_shops_sold_in)

In which case it will show the bucket aggregations for Cat Food and Cabbage

I've tried using Painless for this adding a filter with a Script Query but I have no idea how to access the values of subaggregations, Google has not yielded any results nor has the various combinations of things I've tried

If people are interested I can post some of the attempts I've made

I want to add a filter which only creates buckets in group_by_item_name aggregation if the value of the number_of_unique_shops_sold_in is met:

number_of_unique_shops_sold_in.value > (0.95 * total_number_of_shops_sold_in)

In which case it will show the bucket aggregations for Cat Food and Cabbage

I've tried using Painless for this adding a filter with a Script Query but I have no idea how to access the values of subaggregations, Google has not yielded any results nor has the various combinations of things I've tried

If people are interested I can post some of the (syntax error) attempts I've made

Maybe the example is over-simplified but you only look to have one record per unique item+shop name. In which case the doc count of an item == cardinality of shop names? In which case you can more simply trim buckets by min_doc_count?

Yep you are correct, the example is oversimplified sorry about that, even so I would still need to access the value of the number_of_unique_shops_sold_in an aggregation in the Painless script to compare it against the condition

Here is my attempt at the filtering, it won't compile because of the script but hopefully it's more clear what I'm trying to do

GET shop-item/_search
{
  "size": 0,
  "aggregations": {
    "original_items": {
      "filter": {
        "term": {
          "isOriginalItem": {
            "value": true
          }
        }
      },
      "aggs": {
        "group_by_item_name": {
          "terms": {
            "field": "itemName"
          },
          "aggs": {
            "filter_using_shop_threshold": {
              "filter": {
                "script": {
                  "script": {
                    "source": "return number_of_unique_shops_sold_in._value > 0.95 * total_number_of_shops_sold_in._value;"
                  }
                }
              },
              "aggs": {
                "number_of_unique_shops_sold_in": {
                  "cardinality": {
                    "field": "shopName"
                  }
                },
                "total_number_of_shops_sold_in": {
                  "value_count" : {
                    "field": "shopName"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

This example script has a misconception about the time and the place in which the script runs.

Scripts inside the aggregation tree are given a single document to evaluate on a single shard. The scripts can inspect the fields of that doc but that's the scope of the context it works in. It works in the data-gathering phase on data nodes.

In contrast, the "cardinality" value is a final result value only available after all matching docs have been examined on all shards and returned to a central coordinating node.

To perform any reasoning about final aggregation results on the coordinating node you need to use a pipeline aggregation such as thebucket selector

2 Likes

Thank you so much :slight_smile:

Here's the solution to my question:

GET shop-item/_search
{
  "size": 0,
  "aggregations": {
	"original_items": {
	  "filter": {
		"term": {
		  "isOriginalItem": {
			"value": true
		  }
		}
	  },
	  "aggs": {
		"group_by_item_name": {
		  "terms": {
			"field": "itemName"
		  },
		  "aggs": {
			"number_of_unique_shops_sold_in": {
			  "cardinality": {
				"field": "shopName"
			  }
			},
			"total_number_of_shops_sold_in": {
			  "value_count" : {
				"field": "shopName"
			  }
			},
			"bucket_filter": {
			  "bucket_selector": {
				"buckets_path": {
				  "uniqueShops": "number_of_unique_shops_sold_in",
				  "totalShops": "total_number_of_shops_sold_in"
				},
				"script": "params.uniqueShops > 0.95 * params.totalShops"
			  }
			}
		  }
		}
	  }
	}
  }
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.