Multimatch with CROSS_FIELD query and decompounder

The problem was also discussed by @singer and @hbruch

and

Here a example, that the search works not as expected:

Create a index

PUT example/
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "filter": {
        "decomp_de": {
          "type": "hyphenation_decompounder",
          "word_list": ["kaffee", "tasse", "tüte"],
          "hyphenation_patterns_path": "hyph/de_DR.xml",
          "min_subword_size": 3,
          "only_longest_match": true
        }
      },
      "analyzer": {
        "german_analyzer": {
          "filter": [
            "lowercase",
            "decomp_de",
            "unique"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "german_analyzer",
        "norms": false
      },
      "type": {
        "type": "text",
        "analyzer": "german_analyzer",
        "norms": false
      }
    }
  }
}

Index documents

{"index":{"_id":1}}
{"text": "Kaffeetasse", "type":"Tasse"}
{"index":{"_id":2}}
{"text": "Kaffeetüte", "type": "Tüte"}

Serach for Kaffeetasse

{
  "query": {
    "multi_match": {
      "query": "Kaffeetasse",
      "fields": [
        "text",
        "type"
      ],
      "type": "cross_fields",
      "operator": "and",
      "slop": 1,
      "prefix_length": 0,
      "max_expansions": 50,
      "zero_terms_query": "none",
      "auto_generate_synonyms_phrase_query": "true",
      "fuzzy_transpositions": false,
      "boost": 1
    }
  }
}

The unexpected result and it doesn't matter which type, operator or minmum should match is given is

  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_index" : "example",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931471,
        "_source" : {
          "text" : "Kaffeetasse",
          "type" : "Tasse"
        }
      },
      {
        "_index" : "example",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.25069216,
        "_source" : {
          "text" : "Kaffeetüte",
          "type" : "Tüte"
        }
      }
    ]
  }
}

The expected result is only the document containing Kaffeetasse.

Analyzing:

GET example/_analyze
{
  "analyzer": "german_analyzer"
  , "text": "Kaffeetasse"
}

produces

{
  "tokens" : [
    {
      "token" : "kaffeetasse",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "kaffee",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "tasse",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

And IMHO the query should be rewritten to (text:kaffeetasse OR (text:kaffee AND text: tasse)) OR (type:kaffeetasse OR (type:kaffee AND type: tasse)) and not to (text:kaffeetasse OR text:kaffee OR text: tasse OR type:kaffeetasse OR type:kaffee OR type: tasse).

With the _validate/query

GET example/_validate/query?explain=true&rewrite=true
{
  "query": {
    "multi_match": {
      "query": "Kaffeetasse",
      "fields": [
        "text",
        "type"
      ],
      "type": "cross_fields",
      "operator": "and",
      "slop": 0,
      "prefix_length": 0,
      "max_expansions": 50,
      "minimum_should_match": "-45%",
      "zero_terms_query": "NONE",
      "auto_generate_synonyms_phrase_query": "false",
      "fuzzy_transpositions": false,
      "boost": 1
    }
  }
}

i got

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "example",
      "valid" : true,
      "explanation" : "(text:kaffeetasse | text:kaffee | text:tasse | type:kaffeetasse | type:kaffee | type:tasse)"
    }
  ]
}

or without parameter rewrite:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "example",
      "valid" : true,
      "explanation" : "blended(terms:[text:kaffeetasse, text:kaffee, text:tasse, type:kaffeetasse, type:kaffee, type:tasse])"
    }
  ]
}

Searching for Kaffee should return both documents, searching for Kaffeetasse or tasse the document id 1 and searching for Kaffeetüte or tüte document with the id 2 or what am I misunderstanding?

Based on the example shown above, searching for kaffee tasse returns the expected result:

GET example/_search
{
  "query": {
    "multi_match": {
      "query": "Kaffee tasse",
      "fields": [
      "text", 
        "type",
        "text.raw"
      ],
      "type": "cross_fields",
      "operator": "and",
      "slop": 1,
      "prefix_length": 0,
      "max_expansions": 50,
      "zero_terms_query": "none",
      "auto_generate_synonyms_phrase_query": "false",
      "minimum_should_match": "-66%", 
      "fuzzy_transpositions": false,
      "boost": 1
    }
  }
}
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.2037694,
    "hits" : [
      {
        "_index" : "example",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.2037694,
        "_source" : {
          "text" : "Kaffeetasse",
          "type" : "Tasse"
        }
      }
    ]
  }
}

If I understand it correctly, decompounding is intended to ensure that searching for kaffeetasse or kaffee tasse returns the same results or am I wrong?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.