German compound words in an e-commerce search

Hi elastic!

We are developing an e-commerce application and have been using elasticsearch 2.0 since 2014. We are prevented from upgrading by a problem that we simply cannot solve with the current ES version.

The german language has the concept of compound words. To fit our needs in ES 2.0 we use the plugin elasticsearch-analysis-decompound which works fine.

Typical words in our domain are:
[Kabelkanal, Antennenkabel, Antennenhalterung, Tablethalterung, Tablethalter] =>
[cable duct, aerial wire, aerial holder, tablet holder]
.
Tablethalterung is just a synonym for Tablethalter.

Common search terms are kabel kanal, Kabelkanal, tablethalter or tablet halterung etc.

With ES 2, we get only documents that contains, for example, Kabel and Kanal in one of the searched fields. Documents which contains Kabel in combination with an other word (e.g. Antenne) are not found. It also doesn't matter whether you are search for Kabelkanal or Kabel Kanal.

Simple example with kibana dev tools:
Create Index, analysers and mapping

PUT compound_word_example/
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "custom": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stop_de",
            "domain-specific-decompounder",
            "snow_de",
            "hyphenation_decompounder_de",
            "snow_de",
            "unique"
          ]
        }
      },
      "filter": {
        "stop_de": {
          "type": "stop",
          "ignore_case": true,
          "stopwords": [
            "_german_"
          ]
        },
        "snow_de": {
          "type": "snowball",
          "language": "German2"
        },
        "domain-specific-decompounder": {
          "type": "dictionary_decompounder",
          "word_list": [
            "tablet"
          ]
        },
        "hyphenation_decompounder_de": {
          "type": "hyphenation_decompounder",
          "word_list": [
            "kabel",
            "kanal",
            "antenne",
            "halter",
            "halt",
            "halterung"
          ],
          "hyphenation_patterns_path": "hyph/de_DR.xml",
          "min_subword_size": 4,
          "only_longest_match": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom"
      }
    }
  }
}

Index some products

POST /compound_word_example/_bulk
{"index": {"_id": 1}}
{"title":"Antennen-Kabel"}
{"index": {"_id": 2}}
{"title":"Kabelkanal Blau"}
{"index": {"_id": 3}}
{"title":"Kabelkanal Grün"}
{"index": {"_id": 4}}
{"title":"Antennenkabel"}
{"index": {"_id": 5}}
{"title":"Antennen-Kabel alu"}
{"index": {"_id": 6}}
{"title":"Tablethalterung"}
{"index": {"_id": 7}}
{"title":"Tablet halterung"}
{"index": {"_id": 8}}
{"title":"Wandhalterung"}
{"index": {"_id": 9}}
{"title":"Antennenhalterung"}

Search for Kabelkanal:

POST /compound_word_example/_search
{
  "query": {
    "match": {
      "title": {"query": "kabelkanal", "operator": "and"}
    }
  }
}

Results:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 2.2698462,
    "hits" : [
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.2698462,
        "_source" : {
          "title" : "Kabelkanal Blau"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 2.2698462,
        "_source" : {
          "title" : "Kabelkanal Grüm"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.8161402,
        "_source" : {
          "title" : "Antennenkabel"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.68392545,
        "_source" : {
          "title" : "Antennen-Kabel"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.58857614,
        "_source" : {
          "title" : "Antennen-Kabel alu"
        }
      }
    ]
  }
}

All documents containing Kabelkanal are expected because of kabel and kanal but documents with Antennenkabel are not. If we search for Kabel Kanal the results only contains documents with the term Kabelkanal which would be the expected result. As my understandig this is because of the and operator as described in the documentation.

operator

(Optional, string) Boolean logic used to interpret text in the query value. Valid values are:

OR (Default)

For example, a query value of capital of Hungary is interpreted as capital OR of OR Hungary .

AND

For example, a query value of capital of Hungary is interpreted as capital AND of AND Hungary

The analyzer works as expected

POST compound_word_example/_analyze
{
  "field": "title",
  "text": ["kabelkanal"]
}

produces tokens

{
  "tokens" : [
    {
      "token" : "kabelkanal",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "kabel",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "kanal",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

IMHO when using a decompounder the result should be the same regardless of whether you are searching for kabelkanal or kabel kanal. Only searching for kabel should return all documents containing kabel e.g. Antennenkabel, Kabelkanal, Kabelhalter etc.

Since we have a lot of such words, the synonym filter doesn't seem the right way.

Is there a way, to combine the tokes produced by the decompound analyser with an and ? What we are doing wrong? Any suggestion is welcome.

Can you clarify if what you are demonstrating here was working differently in the old 2.0 version? I just tried running the mini-example above (with minor syntax modifications for the old version) on 2.0 and e.g. when I use the "_analyze" API to check how "kabelkanal gets analyzed in 2.0 I get

{
  "tokens": [
    {
      "token": "kabelkanal",
      "start_offset": 0,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "kabel",
      "start_offset": 0,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "kanal",
      "start_offset": 0,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

which seems the same upon first glance than in a recent 7.12 version.
The query internally rewrites to

{
      "index": "compound_word_example",
      "valid": true,
      "explanation": "title:kabelkanal title:kabel title:kanal"
}

in version 2.0.2, which means the terms are ORed and which is also expected given that the de-compounder puts them in the same position which basically means all three versions are synonyms. I they would appear in different token positions the "operator" would take effect.

Just for comparison: in 7.12 "_analyze" gets me the same as above and the "_validate/query" enpoints explain option gives me

{
      "index" : "compound_word_example",
      "valid" : true,
      "explanation" : "Synonym(title:kabel title:kabelkanal title:kanal)"
}

which is also expected behaviour. The result list seems similar on the small toy dataset you provided on both versions.

So my question really is if this example is supposed to show some change in behaviour that you were seeing in 2.0 and that is somewhat different now, thus preventing your upgrade, or if this is some additional use case or something that was solved differently before. Could you elaborate a bit on that?

@cbuescher
Hi Christoph
Thanks for your answer. The decompound output was just to show that the decompounding works as expected.

The problem seems to be with the query.

In our ES 2.0 solution, a multi_match with cross_fields is set off with a minimum_should_match of -45%.

After trying the example on the ES 2.0, I noticed that Antennenkabel are also found here.

curl -XPOST "http://localhost:9210/compound_word_example/test/_search?pretty" -H 'Content-Type: application/json' -d'{  "query": {    "match": {      "title": {"query": "kabelkanal", "operator": "and"}    }  }}'

returns

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 1.9269364,
    "hits" : [ {
      "_index" : "compound_word_example",
      "_type" : "test",
      "_id" : "2",
      "_score" : 1.9269364,
      "_source":{"title": "Kabelkanal Blau"}
    }, {
      "_index" : "compound_word_example",
      "_type" : "test",
      "_id" : "3",
      "_score" : 1.9269364,
      "_source":{"title": "Kabelkanal Grün"}
    }, {
      "_index" : "compound_word_example",
      "_type" : "test",
      "_id" : "4",
      "_score" : 0.53781134,
      "_source":{"title": "Antennenkabel"}
    }, {
      "_index" : "compound_word_example",
      "_type" : "test",
      "_id" : "1",
      "_score" : 0.33613208,
      "_source":{"title": "Antennen-Kabel"}
    }, {
      "_index" : "compound_word_example",
      "_type" : "test",
      "_id" : "5",
      "_score" : 0.26890567,
      "_source":{"title": "Antennen-Kabel alu"}
    } ]
  }
}

Then I added "minimum_should_match":"-45%"

curl -XPOST "http://localhost:9210/compound_word_example/test/_search?pretty" -H 'Content-Type: application/json' -d'{  "query": {    "match": {      "title": {"query": "kabelkanal", "operator": "and", "minimum_should_match":"-45%" }    }  }}'

With this, the expected result is returned

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.9269364,
    "hits" : [ {
      "_index" : "compound_word_example",
      "_type" : "test",
      "_id" : "2",
      "_score" : 1.9269364,
      "_source":{"title": "Kabelkanal Blau"}
    }, {
      "_index" : "compound_word_example",
      "_type" : "test",
      "_id" : "3",
      "_score" : 1.9269364,
      "_source":{"title": "Kabelkanal Grün"}
    } ]
  }
}

But when executing same query on ES 7.x

POST /compound_word_example/_search
{
  "query": {
    "match": {
      "title": {"query": "kabelkanal", "operator": "and", "minimum_should_match": "-45%"}
    }
  }
}

result with the unexpected Antennenkabel

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 1.0057728,
    "hits" : [
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0057728,
        "_source" : {
          "title" : "Kabelkanal Blau"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0057728,
        "_source" : {
          "title" : "Kabelkanal Grüm"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.8161402,
        "_source" : {
          "title" : "Antennenkabel"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.68392545,
        "_source" : {
          "title" : "Antennen-Kabel"
        }
      },
      {
        "_index" : "compound_word_example",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.58857614,
        "_source" : {
          "title" : "Antennen-Kabel alu"
        }
      }
    ]
  }
}

is returned.

It seems as if would minimum_should_match not considered. The value can be changed at will without affecting the result.