Hi elastic!
We are developing an e-commerce application and have been using elasticsearch 2.0 since 2014. We are prevented from upgrading by a problem that we simply cannot solve with the current ES version.
The german language has the concept of compound words. To fit our needs in ES 2.0 we use the plugin elasticsearch-analysis-decompound which works fine.
Typical words in our domain are:
[Kabelkanal, Antennenkabel, Antennenhalterung, Tablethalterung, Tablethalter] =>
[cable duct, aerial wire, aerial holder, tablet holder].
Tablethalterung
is just a synonym for Tablethalter
.
Common search terms are kabel kanal
, Kabelkanal
, tablethalter
or tablet halterung
etc.
With ES 2, we get only documents that contains, for example, Kabel
and Kanal
in one of the searched fields. Documents which contains Kabel
in combination with an other word (e.g. Antenne
) are not found. It also doesn't matter whether you are search for Kabelkanal
or Kabel Kanal
.
Simple example with kibana dev tools:
Create Index, analysers and mapping
PUT compound_word_example/
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"custom": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"stop_de",
"domain-specific-decompounder",
"snow_de",
"hyphenation_decompounder_de",
"snow_de",
"unique"
]
}
},
"filter": {
"stop_de": {
"type": "stop",
"ignore_case": true,
"stopwords": [
"_german_"
]
},
"snow_de": {
"type": "snowball",
"language": "German2"
},
"domain-specific-decompounder": {
"type": "dictionary_decompounder",
"word_list": [
"tablet"
]
},
"hyphenation_decompounder_de": {
"type": "hyphenation_decompounder",
"word_list": [
"kabel",
"kanal",
"antenne",
"halter",
"halt",
"halterung"
],
"hyphenation_patterns_path": "hyph/de_DR.xml",
"min_subword_size": 4,
"only_longest_match": true
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom"
}
}
}
}
Index some products
POST /compound_word_example/_bulk
{"index": {"_id": 1}}
{"title":"Antennen-Kabel"}
{"index": {"_id": 2}}
{"title":"Kabelkanal Blau"}
{"index": {"_id": 3}}
{"title":"Kabelkanal Grün"}
{"index": {"_id": 4}}
{"title":"Antennenkabel"}
{"index": {"_id": 5}}
{"title":"Antennen-Kabel alu"}
{"index": {"_id": 6}}
{"title":"Tablethalterung"}
{"index": {"_id": 7}}
{"title":"Tablet halterung"}
{"index": {"_id": 8}}
{"title":"Wandhalterung"}
{"index": {"_id": 9}}
{"title":"Antennenhalterung"}
Search for Kabelkanal:
POST /compound_word_example/_search
{
"query": {
"match": {
"title": {"query": "kabelkanal", "operator": "and"}
}
}
}
Results:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : 2.2698462,
"hits" : [
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.2698462,
"_source" : {
"title" : "Kabelkanal Blau"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "3",
"_score" : 2.2698462,
"_source" : {
"title" : "Kabelkanal Grüm"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.8161402,
"_source" : {
"title" : "Antennenkabel"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.68392545,
"_source" : {
"title" : "Antennen-Kabel"
}
},
{
"_index" : "compound_word_example",
"_type" : "_doc",
"_id" : "5",
"_score" : 0.58857614,
"_source" : {
"title" : "Antennen-Kabel alu"
}
}
]
}
}
All documents containing Kabelkanal
are expected because of kabel
and kanal
but documents with Antennenkabel
are not. If we search for Kabel Kanal
the results only contains documents with the term Kabelkanal
which would be the expected result. As my understandig this is because of the and operator as described in the documentation.
operator
(Optional, string) Boolean logic used to interpret text in the
query
value. Valid values are:
OR
(Default)For example, a
query
value ofcapital of Hungary
is interpreted ascapital OR of OR Hungary
.
AND
For example, a
query
value ofcapital of Hungary
is interpreted ascapital AND of AND Hungary
The analyzer works as expected
POST compound_word_example/_analyze
{
"field": "title",
"text": ["kabelkanal"]
}
produces tokens
{
"tokens" : [
{
"token" : "kabelkanal",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "kabel",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "kanal",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
IMHO when using a decompounder the result should be the same regardless of whether you are searching for kabelkanal
or kabel kanal
. Only searching for kabel
should return all documents containing kabel
e.g. Antennenkabel
, Kabelkanal
, Kabelhalter
etc.
Since we have a lot of such words, the synonym filter doesn't seem the right way.
Is there a way, to combine the tokes produced by the decompound analyser with an and ? What we are doing wrong? Any suggestion is welcome.