Hi,
I used hyphenation_decompounder
(https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hyp-decomp-tokenfilter.html) for German language and followed the example as mentioned in the documentation. So far so good. it works!. The text kaffeetasse
is tokenised into kaffee
and tasse
.
The concern arose when I built "multi-match" query kaffeetasse
to find documents where kaffee
AND tasse
both matches. It seems that multi-match uses OR for these tokens instead of given Operator
in multi-match query. Here is my Test-case
curl -XPUT "http://localhost:9200/testidx" -H 'Content-Type: application/json' -d'{ "settings": { "index": { "analysis": { "analyzer": { "index": { "type" : "custom", "tokenizer": "whitespace", "filter": [ "lowercase" ] }, "search": { "type" : "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "hyph" ] } }, "filter": { "hyph": { "type": "hyphenation_decompounder", "hyphenation_patterns_path": "analysis/de_DR.xml", "word_list": ["kaffee", "zucker", "tasse"], "only_longest_match": true, "min_subword_size": 4 } } } } }, "mappings" : { "properties" : { "title" : { "type" : "text", "analyzer": "index", "search_analyzer": "search" }, "description" : { "type" : "text", "analyzer": "index", "search_analyzer": "search" } } } }'
curl -XPOST "http://localhost:9200/testidx/_doc/1" -H 'Content-Type: application/json' -d'{ "title" : "Kaffee", "description": "Milch Kaffee tasse"}'
curl -XPOST "http://localhost:9200/testidx/_doc/2" -H 'Content-Type: application/json' -d'{ "title" : "Kaffee", "description": "Latte Kaffee Becher"}'
curl -XGET "http://localhost:9200/testidx/_search" -H 'Content-Type: application/json' -d'{ "query": { "multi_match": { "query": "kaffeetasse", "fields": ["title", "description"], "operator": "and", "type": "cross_fields", "analyzer": "search" } }}'
I expected only document id=1 as it has "kaffee" and "tasse" in their fields but query returns both documents as they contains terms "kaffee" or "tasse" .
At first glance, it seems a bug to me. Any thoughts about that ?
Elasticsearch: 7.9.2
de_DR.xml
downloaded from https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download as mentioned in the documentation.
For non-german speakers
kaffeetasse => coffee cup