I'm trying to implement what's called a managed vocabulary (which is an extension of a taxonomy that also accounts for synonyms) based on the ideas presented in this article : Patterns for Elasticsearch Synonyms: Taxonomies and Managed Vocabularies and I stumbled upon some issues regarding the classification of the terms and the result of the query.
Here an example which explain the problem:
Assuming I have the following taxonomy:
Computer (has synonyms : Ordinateur...)
└── Laptop (has synonyms : PC_Protable...)
└── Mini (has synonyms : Mini_Laptop ...)
What I wanted is that if the user looked for computer he will get in the search results all descriptions that contain the word computer or its synonym Ordinateur and afterword the descriptions that contain Laptop etc, until it reaches the end of the tree (in this case Mini). Here's what I've done:
I indexed the data with and without synonyms as @abdon suggested in this answer:
PUT taxon_test
{
"mappings": {
"tech": {
"properties": {
"description": {
"type": "text",
"fields": {
"synonyms": {
"type": "text",
"analyzer": "taxonomy_text"
},
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
},
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
},
"vocab_taxonomy": {
"type": "synonym",
"tokenizer": "keyword",
"synonyms": [
"computer, ordinateur => computer, laptop, mini",
"laptop, pc_protable => laptop, mini",
"mini, mini_laptop => mini"
]
}
},
"analyzer": {
"taxonomy_text": {
"tokenizer": "standard",
"filter": ["lowercase", "my_stop", "vocab_taxonomy"]
}
}
}
}
}
And filled the index:
PUT taxon_test/tech/_bulk
{ "index" : { "_id" : "1" } }
{ "description": "Modern computer has the ability to follow generalized sets of operations." }
{ "index" : { "_id" : "2" } }
{ "description": "Modern computer is very different from early computer."}
{ "index" : { "_id" : "3" } }
{ "description": "Dell's XPS 13 remains one of the best all-around 13-inch laptops." }
{ "index" : { "_id" : "4" } }
{ "description": "Find a great collection of laptop at HP." }
{ "index" : { "_id" : "5" } }
{ "description": "Mini Samsung Chromebook 3 has a good design." }
{ "index" : { "_id" : "6" } }
{ "description": "Dell Latitude is a mini too." }
{ "index" : { "_id" : "7" } }
{ "description": "Ordinateur is in french." }
{ "index" : { "_id" : "8" } }
{ "description": "Find the laptop to suit your needs when you shop." }
When looking for computer
using the following query:
GET taxon_test/tech/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"description.synonyms": "computer"
}
}
],
"should": [
{
"match": {
"description": "computer"
}
}
]
}
}
}
I get all the results sorted (all description that contain computer -> next laptop -> next mini ) exactly as I wanted:
"Modern computer is very different from early computer." (score: 2.1292622)
"Modern computer has the ability to follow generalized sets of operations." (score: 1.4279017)
"Ordinateur is in french." (score: 0.34066662)
"Find a great collection of laptop at HP." (score: 0.27848446)
"Find the laptop to suit your needs when you shop." (score: 0.24843818)
"Dell Latitude is a mini too." (score: 0.22731867)
"Mini Samsung Chromebook 3 has a good design." (score: 0.18983713)
but when I use the same query and look for laptop
I get also computer
, here is an example of the result (What I wanted is only laptop first and the mini):
"Find a great collection of laptop at HP." (score: 1.591003)
"Find the laptop to suit your needs when you shop." (score: 1.4431245)
"Ordinateur is in french." (score: 0.31679824)
"Modern computer is very different from early computer." (score: 0.31380016)
"Modern computer has the ability to follow generalized sets of operations." (score: 0.24843818)
"Dell Latitude is a mini too." (score: 0.22731867)
"Mini Samsung Chromebook 3 has a good design." (score: 0.18983713)
The same thing happens when searching for mini
. I'm aware that this behavior is caused by the vocab_taxonomy
filter and I don't know how to resolve this issue, but I hope that the answers to my questions will help me do so:
-
I don't understand clearly how
vocab_taxonomy
works: What I know is that in the line"mini, mini_laptop => mini"
, mini and mini_laptop will be transformed to mini, but what happens in the other two cases ("computer, ordinateur => computer, laptop, mini"
and"laptop, pc_protable => laptop, mini"
) ? -
Going back to the search result, how can I make sure that ES go ONLY down in tree and get results related to the taxonomies/categories that are below the queried word ?
-
Is there a way to control how ES climbs up/goes down in the taxonomy tree ?
Thank you for your time !