Hi elastic!
We are developing an e-commerce application and have been using elasticsearch 2.0 since 2014. We are prevented from upgrading by a problem that we simply cannot solve with the current ES version.
The german language has the concept of compound words. To fit our needs in ES 2.0 we use the plugin elasticsearch-analysis-decompound which works fine.
Typical words in our domain are:
[Kabelkanal, Antennenkabel, Antennenhalterung, Tablethalterung, Tablethalter] =>
[cable duct, aerial wire, aerial holder, tablet holder].
Tablethalterung is just a synonym for Tablethalter.
Common search terms are kabel kanal, Kabelkanal, tablethalter or tablet halterung etc.
With ES 2, we get only documents that contains, for example, Kabel and Kanal in one of the searched fields. Documents which contains Kabel in combination with an other word (e.g. Antenne) are not found. It also doesn't matter whether you are search for Kabelkanal or Kabel Kanal.
Simple example with kibana dev tools:
Create Index, analysers and mapping
PUT compound_word_example/
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"custom": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"stop_de",
"domain-specific-decompounder",
"snow_de",
"hyphenation_decompounder_de",
"snow_de",
"unique"
]
}
},
"filter": {
"stop_de": {
"type": "stop",
"ignore_case": true,
"stopwords": [
"_german_"
]
},
"snow_de": {
"type": "snowball",
"language": "German2"
},
"domain-specific-decompounder": {
"type": "dictionary_decompounder",
"word_list": [
"tablet"
]
},
"hyphenation_decompounder_de": {
"type": "hyphenation_decompounder",
"word_list": [
"kabel",
"kanal",
"antenne",
"halter",
"halt",
"halterung"
],
"hyphenation_patterns_path": "hyph/de_DR.xml",
"min_subword_size": 4,
"only_longest_match": true
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom"
}
}
}
}
Index some products
POST /compound_word_example/_bulk
{"index": {"_id": 1}}
{"title":"Antennen-Kabel"}
{"index": {"_id": 2}}
{"title":"Kabelkanal Blau"}
{"index": {"_id": 3}}
{"title":"Kabelkanal Grün"}
{"index": {"_id": 4}}
{"title":"Antennenkabel"}
{"index": {"_id": 5}}
{"title":"Antennen-Kabel alu"}
{"index": {"_id": 6}}
{"title":"Tablethalterung"}
{"index": {"_id": 7}}
{"title":"Tablet halterung"}
{"index": {"_id": 8}}
{"title":"Wandhalterung"}
{"index": {"_id": 9}}
{"title":"Antennenhalterung"}
Search for Kabelkanal:
POST /compound_word_example/_search
{
"query": {
"match": {
"title": {"query": "kabelkanal", "operator": "and"}
}
}
}
Results:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : 2.2698462,
"hits" : [
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.2698462,
"_source" : {
"title" : "Kabelkanal Blau"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "3",
"_score" : 2.2698462,
"_source" : {
"title" : "Kabelkanal Grüm"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8161402,
"_source" : {
"title" : "Antennenkabel"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.68392545,
"_source" : {
"title" : "Antennen-Kabel"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.58857614,
"_source" : {
"title" : "Antennen-Kabel alu"
}
}
]
}
}
All documents containing Kabelkanal are expected because of kabel and kanal but documents with Antennenkabel are not. If we search for Kabel Kanal the results only contains documents with the term Kabelkanal which would be the expected result. As my understandig this is because of the and operator as described in the documentation.
operator(Optional, string) Boolean logic used to interpret text in the
queryvalue. Valid values are:
OR(Default)For example, a
queryvalue ofcapital of Hungaryis interpreted ascapital OR of OR Hungary.
ANDFor example, a
queryvalue ofcapital of Hungaryis interpreted ascapital AND of AND Hungary
The analyzer works as expected
POST compound_word_example/_analyze
{
"field": "title",
"text": ["kabelkanal"]
}
produces tokens
{
"tokens" : [
{
"token" : "kabelkanal",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "kabel",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "kanal",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
IMHO when using a decompounder the result should be the same regardless of whether you are searching for kabelkanal or kabel kanal. Only searching for kabel should return all documents containing kabel e.g. Antennenkabel, Kabelkanal, Kabelhalter etc.
Since we have a lot of such words, the synonym filter doesn't seem the right way.
Is there a way, to combine the tokes produced by the decompound analyser with an and ? What we are doing wrong? Any suggestion is welcome.