Hello there !
I'm trying to model a collection of product in Elasticsearch, each product having a label and one or more categories.
Example:
{
"label": "Galaxy S4",
"categoryPath": ["Smartphone/Android/5.1"]
}
{
"label": "Galaxy S6",
"categoryPath": ["Smartphone/Android/6.0"]
}
{
"label": "Iphone 6s",
"categoryPath": ["Smartphone/IOS"]
}
And the category tree for this example:
| /
| / Smartphone
| / Smartphone / Android
| / Smartphone / Android / 5.1
| / Smartphone / Android / 6.0
| / Smartphone / IOS
What I would like to do is retrieving the number of product per category level, e.g: how many products are located in the "Smartphone" category? And I expect it to return two buckets for the children categories only (Android and IOS).
I'm using the path_hierarchy tokenizer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html) on the categoryPath field and a term aggregation to request products.
I've seen that I could use the include/exclude parameters to filter category levels using regexp: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2
So for example, I could request for how many products are located in the "Smartphone" category with:
GET my_index/product/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"categoryPath.tokenized": "/Smartphone"
}
}
}
},
"aggs": {
"category": {
"terms": {
"field": "categoryPath.tokenized",
"size": 0,
"include": "\/Smartphone\/.*",
"exclude": "\/Smartphone\/.*\/.*"
}
}
}
}
But I'm curious about the performance impact(s) here, as I'm expecting to store 5M+ products and a lot of categories.
So, is this the only way to achieve what I'm looking for? Should I review my model?
FYI, I'm using the following configuration/mapping:
{
"settings": {
"analysis": {
"analyzer": {
"path_analyzer": {
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"product": {
"properties": {
"label": {
"type": "string",
"analyzer": "english"
},
"categoryPath": {
"type": "string",
"index": "not_analyzed",
"doc_values": true,
"fields": {
"tokenized": {
"type": "string",
"analyzer": "path_analyzer"
}
}
}
}
}
}
}
Thanks in advance for your answers !