Multi match query with custom analyzer and 'and' operator

My use case is to search against address fields with non ASCII characters insensitive.
Here is my mapping:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      },
      "analyzer": {
        "ascii_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ascii_folding"
          ]
        }
      },
      "number_of_shards": 1
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "City": {
          "type": "text",
          "analyzer": "ascii_analyzer"
        },
        "County": {
          "type": "text",
          "analyzer": "ascii_analyzer"
        },
        "PostCode": {
          "type": "text"
        }
      }
    }
  }
}

I am adding a document:

PUT test/_doc/1
{
  "City": "Wrocław",
  "County": "Dolnośląskie",
  "PostCode": "53900"
}

And here is my query, I want to return documents with all entered words in any fields but it does not return any results:

GET test/_search
{
  "query": {
    "multi_match": {
      "query": "wroclaw dolnoslaskie 53900",
      "operator": "and",
      "type": "cross_fields",
      "fields": [
        "City",
        "County",
        "PostCode"
      ]
    }
  }
}

It works when operator is 'or' and also works fine when I omit PostCode or set PostCode to have ascii_analyzer (although it does not make sense).

Hey,

thanks a bunch for the complete example, this makes things so easy to understand! Minor nit: Specifying the Elasticsearch version would help a lot.

So let's take this for a spin. Creating the index, allows us to run the _analyze API to understand what is stored in the inverted index.

GET test/_analyze
{
  "text": [ "Wrocław", "Dolnośląskie", "53900" ],
  "analyzer": "ascii_analyzer"
}

response is

{
  "tokens" : [
    {
      "token" : "wroclaw",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "wrocław",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "dolnoslaskie",
      "start_offset" : 8,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 101
    },
    {
      "token" : "dolnośląskie",
      "start_offset" : 8,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 101
    },
    {
      "token" : "53900",
      "start_offset" : 21,
      "end_offset" : 26,
      "type" : "<NUM>",
      "position" : 202
    }
  ]
}

This looks good, as this means, that wroclaw and dolnoslaskie without the special chars will be put in the inverted index.

So, maybe the query is the culprit? Let's use the explain API to find out more

GET test/_explain/1
{
  "query": {
    "multi_match": {
      "query": "wroclaw dolnoslaskie 53900",
      "type": "cross_fields",
      "operator": "and", 
      "fields": [
        "City",
        "County",
        "PostCode"
      ]
    }
  }
}

returns

{
  "_index" : "test",
  "_type" : "_doc",
  "_id" : "1",
  "matched" : false,
  "explanation" : {
    "value" : 0.0,
    "description" : "No matching clause",
    "details" : [ ]
  }
}

All right, so apparently, no query matches. Let's use the validate API to check what queries are created.

GET test/_validate/query?rewrite=true
{
  "query": {
    "multi_match": {
      "query": "wroclaw dolnoslaskie 53900",
      "type": "cross_fields",
      "operator" : "and",
      "fields": [
        "City",
        "County",
        "PostCode"
      ]
    }
  }
}

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test",
      "valid" : true,
      "explanation" : "((+PostCode:wroclaw +PostCode:dolnoslaskie +PostCode:53900) | (+(City:wroclaw | County:wroclaw) +(City:dolnoslaskie | County:dolnoslaskie) +(City:53900 | County:53900)))"
    }
  ]
}

GET test/_validate/query?rewrite=true
{
  "query": {
    "multi_match": {
      "query": "wroclaw dolnoslaskie 53900",
      "type": "cross_fields",
      "fields": [
        "City",
        "County",
        "PostCode"
      ]
    }
  }
}

# GET test/_validate/query?rewrite=true
{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test",
      "valid" : true,
      "explanation" : "((PostCode:wroclaw PostCode:dolnoslaskie PostCode:53900) | ((City:wroclaw | County:wroclaw) (City:dolnoslaskie | County:dolnoslaskie) (City:53900 | County:53900)))"
    }
  ]
}

Ok, so this sheds some light on why the first query does not match. The query that gets constructed ((+PostCode:wroclaw +PostCode:dolnoslaskie +PostCode:53900) | (+(City:wroclaw | County:wroclaw) +(City:dolnoslaskie | County:dolnoslaskie) +(City:53900 | County:53900))) will not have any result. What I cannot tell you on top of my head is, why exactly this query is constructed the way.

I found why it behaves like that:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#cross-field-analysis

It seems that having 'cross_fields', 'and' operator and different analyzers causes that all terms must match in a single field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.