Aggregation on a materialized path


(Deviantony) #1

Hello there !

I'm trying to model a collection of product in Elasticsearch, each product having a label and one or more categories.

Example:

{
  "label": "Galaxy S4",
  "categoryPath": ["Smartphone/Android/5.1"]
}

{
  "label": "Galaxy S6",
  "categoryPath": ["Smartphone/Android/6.0"]
}

{
  "label": "Iphone 6s",
  "categoryPath": ["Smartphone/IOS"]
}

And the category tree for this example:

| /
| / Smartphone
| / Smartphone / Android
| / Smartphone / Android / 5.1
| / Smartphone / Android / 6.0
| / Smartphone / IOS

What I would like to do is retrieving the number of product per category level, e.g: how many products are located in the "Smartphone" category? And I expect it to return two buckets for the children categories only (Android and IOS).

I'm using the path_hierarchy tokenizer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html) on the categoryPath field and a term aggregation to request products.

I've seen that I could use the include/exclude parameters to filter category levels using regexp: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2

So for example, I could request for how many products are located in the "Smartphone" category with:

GET my_index/product/_search
{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "term": {
          "categoryPath.tokenized": "/Smartphone"
        }
      }
    }
  },
  "aggs": {
    "category": {
      "terms": {
        "field": "categoryPath.tokenized",
        "size": 0,
        "include": "\/Smartphone\/.*",
        "exclude": "\/Smartphone\/.*\/.*"
      }
    }
  }
}

But I'm curious about the performance impact(s) here, as I'm expecting to store 5M+ products and a lot of categories.

So, is this the only way to achieve what I'm looking for? Should I review my model?

FYI, I'm using the following configuration/mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "path_analyzer": {
          "tokenizer": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "label": {
          "type": "string",
          "analyzer": "english"
        },
        "categoryPath": {
          "type": "string",
          "index": "not_analyzed",
          "doc_values": true,
          "fields": {
            "tokenized": {
              "type": "string",
              "analyzer": "path_analyzer"
            }
          }
        }
      }
    }
  }
}

Thanks in advance for your answers !


Performance impact of the "include/exclude" fields of an aggregation
(Ivan Brusic) #2

The ES team closed a year old issue to support such a feature two weeks
ago, but someone created a plugin to support it. Never used it.


Ivan


(Deviantony) #3

Ok, seems that there is no native way to achieve these kind of aggregation.

Is there any alternative model / query to retrieve this aggregation?


(system) #4