Using taxonomy and synonyms in elasticsearch 6


(Bilal) #1

I'm trying to implement what's called a managed vocabulary (which is an extension of a taxonomy that also accounts for synonyms) based on the ideas presented in this article : Patterns for Elasticsearch Synonyms: Taxonomies and Managed Vocabularies and I stumbled upon some issues regarding the classification of the terms and the result of the query.

Here an example which explain the problem:

Assuming I have the following taxonomy:

Computer (has synonyms : Ordinateur...)
└── Laptop (has synonyms : PC_Protable...)
    └── Mini (has synonyms : Mini_Laptop ...)

What I wanted is that if the user looked for computer he will get in the search results all descriptions that contain the word computer or its synonym Ordinateur and afterword the descriptions that contain Laptop etc, until it reaches the end of the tree (in this case Mini). Here's what I've done:

I indexed the data with and without synonyms as @abdon suggested in this answer:

PUT taxon_test
{
  "mappings": {
    "tech": {
      "properties": {
        "description": {
          "type": "text",
          "fields": {
            "synonyms": {
              "type": "text",
              "analyzer": "taxonomy_text"
            },
            "keyword": {
             "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "vocab_taxonomy": {
          "type": "synonym",
          "tokenizer": "keyword",
          "synonyms": [ 
            "computer, ordinateur => computer, laptop, mini",
            "laptop, pc_protable => laptop, mini",
            "mini, mini_laptop => mini"
            ]
        }
      },
      "analyzer": {
        "taxonomy_text": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stop", "vocab_taxonomy"]
        }
      }
    }
  }
}

And filled the index:

PUT taxon_test/tech/_bulk
{ "index" : { "_id" : "1" } }
{ "description": "Modern computer has the ability to follow generalized sets of operations." }
{ "index" : { "_id" : "2" } }
{ "description": "Modern computer is very different from early computer."}
{ "index" : { "_id" : "3" } }
{ "description": "Dell's XPS 13 remains one of the best all-around 13-inch laptops." }
{ "index" : { "_id" : "4" } }
{ "description": "Find a great collection of laptop at HP." }
{ "index" : { "_id" : "5" } }
{ "description": "Mini Samsung Chromebook 3 has a good design." }
{ "index" : { "_id" : "6" } }
{ "description": "Dell Latitude is a mini too." }
{ "index" : { "_id" : "7" } }
{ "description": "Ordinateur is in french." }
{ "index" : { "_id" : "8" } }
{ "description": "Find the laptop to suit your needs when you shop." }

When looking for computer using the following query:

GET taxon_test/tech/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "description.synonyms": "computer"
          }
        }
      ],
      "should": [
        {
          "match": {
            "description": "computer"
          }
        }
      ]
    }
  }
}

I get all the results sorted (all description that contain computer -> next laptop -> next mini ) exactly as I wanted:

"Modern computer is very different from early computer." (score: 2.1292622)
"Modern computer has the ability to follow generalized sets of operations." (score: 1.4279017)
"Ordinateur is in french." (score: 0.34066662)
"Find a great collection of laptop at HP." (score: 0.27848446)
"Find the laptop to suit your needs when you shop." (score: 0.24843818)
"Dell Latitude is a mini too." (score: 0.22731867)
"Mini Samsung Chromebook 3 has a good design." (score: 0.18983713)

but when I use the same query and look for laptop I get also computer, here is an example of the result (What I wanted is only laptop first and the mini):

"Find a great collection of laptop at HP." (score: 1.591003)
"Find the laptop to suit your needs when you shop." (score: 1.4431245)
"Ordinateur is in french." (score: 0.31679824)
"Modern computer is very different from early computer." (score: 0.31380016)
"Modern computer has the ability to follow generalized sets of operations." (score: 0.24843818)
"Dell Latitude is a mini too." (score: 0.22731867)
"Mini Samsung Chromebook 3 has a good design." (score: 0.18983713)

The same thing happens when searching for mini. I'm aware that this behavior is caused by the vocab_taxonomy filter and I don't know how to resolve this issue, but I hope that the answers to my questions will help me do so:

  1. I don't understand clearly how vocab_taxonomy works: What I know is that in the line "mini, mini_laptop => mini", mini and mini_laptop will be transformed to mini, but what happens in the other two cases ("computer, ordinateur => computer, laptop, mini" and "laptop, pc_protable => laptop, mini") ?

  2. Going back to the search result, how can I make sure that ES go ONLY down in tree and get results related to the taxonomies/categories that are below the queried word ?

  3. Is there a way to control how ES climbs up/goes down in the taxonomy tree ?

Thank you for your time !


(Christoph) #2

Hi @Bilal

this is an interesting problem and I just read the Blog post myself, very interesting suggestions as well. I will try to dig deeper into this topic, but for now let me just answer your first question:

This type of rule (in the Solr synonym file syntax) means that for each match of a token on the left handside, all the tokens on the right handside get inserted. In the first case (mini, mini_laptop => mini), when the filter sees "mini" or "mini_laptop" it replaces it with just "mini" (so actually tokens that are already "mini" are not affected).

In the second case (computer, ordinateur => computer, laptop, mini), when the filter sees computer it replaces it with computer, laptop and mini, and inserts them as alternative (synonym) terms at the same text position. You can check what happens under the hood e.g. by using the term vectors API. For the first document you get:

GET /taxon_test/tech/1/_termvectors
{
  "fields": ["description.synonyms"]
}

gives

{
  [...]
  "term_vectors": {
    "description.synonyms": {
      "field_statistics": { [... ] },
      "terms": {
        [...]
        "computer": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 7,
              "end_offset": 15
            }
          ]
        }, [...] ,
        "laptop": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 7,
              "end_offset": 15
            }
          ]
        },
        "mini": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 7,
              "end_offset": 15
            }
          ]
        }
   }
}

As you can see, the index contains three entries ("computer", "laptop" and "mini") in the same position (position 1). You will see similar behaviour for "ordinateur" in document 7, only that "ordinateur" itself is not indexed anymore because it doesn't appear on the right handside of the synonym rule. The same things should happen for the third rule (laptop, pc_protable => laptop, mini') from your example.

Looking at your following questions, I suspect that you are missing two things: first, I think the examples in the blog post contains a first synonym filter (the "autophras_syn" filter in the blog post example, that turns two token phrases like "mini laptop" into a one token phrase ("mini_laptop")

Seconds, I think your examples got the order of the synonym rules that you derived from the taxonomy wrong. The blog post suggests to turn a taxonomy tree like:

 animal  -> mammal -> elephant -> african_elephant

into a rule like:

african_elephant => african_elephant, elephant, mammal, animal

Think about it this way: an "african elephant" is an "elephant" is a "mammal" is an "animal". So if you substitute one for the other going up the tree you don't state anything wrong. If however you reverse the direction with a rule like "mammal => mammal, elephant". you are basically saying that each "mammal" occurance should also be treated like if "elephant" occured, which can lead to trouble.

This is why I think your rules like "computer, ordinateur => computer, laptop, mini" should probably be the other way round, since not all computers are laptops and not all laptops are minis (if I understand the domain right).

I will play a bit further with your examples to come up with a better performance, this is certainly an interesting topic and it takes a bit of tinkering to get right, but I think it's really worth it in the end.


(Christoph) #3

Hi @Bilal,

I found at least one more missing piece in your taxonomy example (probably there are more). In addition to adding some "autophras_syn" filter like in the blogpost and reversing some of the synonym expansion rules, it also makes sense to not apply the synonym taxonomy analysis in the search phase in your case. Here's why:

If you have rules like:

       "synonyms": [ 
            "ordinateur => ordinateur, computer",
            "laptop => laptop, computer",
            "laptop => laptop, computer",
            "pc_portable => pc_portable, laptop, computer",
            "mini => mini, laptop, computer",
            "mini_laptop => mini_laptop, mini, laptop, computer"
            ]
        }

at index time, you are basically saying things like "a document containing 'ordinateur' should also be considered a document containing 'computer'" or that documents containing "laptop" are documents about computers too. When you don't specify a specific "search_analyzer" for the field, the same analysis process is applied to the query. So a query for "mini" gets treated as if the user also looks for "laptop" and "computer". In some use cases this might actually be intended, but in your case you state that "laptop" searches should not include the super-set of all "computer" documents, so the synonym expansion shouldn't be applied here.

Take a look at this small example that I derived from your first post. Do the results you get now look better and more close to what you are trying to get at?

DELETE taxon_test

PUT taxon_test
{
  "mappings": {
    "tech": {
      "properties": {
        "description": {
          "type": "text",
          "fields": {
            "synonyms": {
              "type": "text",
              "analyzer": "taxonomy_text",
              "search_analyzer": "taxonomy_text_search"
            },
            "keyword": {
             "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "autophrase_syn": {
          "type": "synonym",
          "synonyms": [ 
            "pc portable => pc_portable",
            "mini latop => mini_laptop"
            ]
        },
        "vocab_taxonomy": {
          "type": "synonym",
          "tokenizer": "keyword",
          "synonyms": [ 
            "ordinateur => ordinateur, computer",
            "laptop => laptop, computer",
            "laptop => laptop, computer",
            "pc_portable => pc_portable, laptop, computer",
            "mini => mini, laptop, computer",
            "mini_laptop => mini_laptop, mini, laptop, computer"
            ]
        }
      },
      "analyzer": {
        "taxonomy_text": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stop", "autophrase_syn", "vocab_taxonomy"]
        },
        "taxonomy_text_search": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stop", "autophrase_syn"]
        }
      }
    }
  }
}

PUT taxon_test/tech/_bulk
{ "index" : { "_id" : "1" } }
{ "description": "Modern computer has the ability to follow generalized sets of operations." }
{ "index" : { "_id" : "2" } }
{ "description": "Modern computer is very different from early computer."}
{ "index" : { "_id" : "3" } }
{ "description": "Dell's XPS 13 remains one of the best all-around 13-inch laptops." }
{ "index" : { "_id" : "4" } }
{ "description": "Find a great collection of laptop at HP." }
{ "index" : { "_id" : "5" } }
{ "description": "Mini Samsung Chromebook 3 has a good design." }
{ "index" : { "_id" : "6" } }
{ "description": "Dell Latitude is a mini too." }
{ "index" : { "_id" : "7" } }
{ "description": "Ordinateur is in french." }
{ "index" : { "_id" : "8" } }
{ "description": "Find the laptop to suit your needs when you shop." }

GET taxon_test/tech/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "description.synonyms": "laptop"
          }
        }
      ],
      "should": [
        { "match": {
           "description": "laptop"
        }
        }
      ]
    }
  }
}

GET /taxon_test/tech/7/_termvectors
{
  "fields": ["description.synonyms"]
}

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.