Terms aggregation: filter buckets based on parent bucket

vincetrumental · November 13, 2018, 5:00pm

Hello,
We are dealing with categorized documents, with a category tree modelized by flattened fields. The category is:
product.category.code (example values: R01, R02, R12...)
The subcategories and sub-subcategories are:
product.category.category.code (example values: R01F01, R02F01, R12F04...)
and product.category.category.category.code (example values: R01F01SF01, R02F03SF02, R12F02SF04...)
A single document can have multiple values for either category level field.

With the following query, we have duplicate subcategory buckets:

{
  "size": 0,
  "aggs": {
    "Topcategories": {
      "terms": {
        "field": "product.categories.code",
        "size": 2147483647,
        "order": {
          "_term": "asc"
        }
      },
      "aggs": {
        "subcat_1": {
          "terms": {
            "field": "product.categories.categories.code",
            "size": 2147483647,
            "order": {
              "_term": "asc"
            }
          },
          "aggs": {
            "subcat_2": {
              "terms": {
                "field": "product.categories.categories.categories.code",
                "size": 2147483647,
                "order": {
                  "_term": "asc"
                }
              }
            }
          }
        }
      }
    }
  }
}

because if a document is in categories R01, R01F01, R02, R02F04 , we will have R02F04 as a sub-aggregation bucket of R01 as well as R02, which is not what we want. We have "lost" the business information that R02F04 is a subnode of R02.
Then we would like to filter in sub-aggregation so that we retain only the term buckets which start with the same value as the parent aggregation corresponding bucket.
It it possible in any way ? is it a modelization problem ? In which case, which improvement would you suggest ?
For now we can "clean-up" the tree structure coming from the aggs query, but this is costly both in elasticsearch and in our application due to combinatorics (we get 90K nodes whereas the "real", filtered structure would have only 700 nodes)
Thanks for any help or suggestion !

system · December 11, 2018, 5:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.