Token filter doesn't work on wildcard-search

Hi

I'm trying to implement a filter to find compounded words and needs to combine it with wildcard-search but it doesn't work.

Example case:
There are documents with names "Coca-Cola", "Cocacola" and "Coca Cola". I want hits from all of them when searching on each word.

My mappings:

  • "whitespace" to create token for each word
  • "shingle" to compound words
  • "pattern_replace" to remove spaces etc
PUT /testindex
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "test_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "shingle",
            "remove_special_characters_filter"
          ]
        }
      },
      "remove_special_characters_filter": {
        "type": "pattern_replace",
        "pattern": """[^\p{L}\p{Nd}]""",
        "replacement": ""
      }
    }
  }
}

Creating documents:

PUT /testindex/_doc/1
{
  "name": "Coca-Cola"
}
PUT /testindex/_doc/2
{
  "name": "CocaCola"
}
PUT /testindex/_doc/3
{
  "name": "Coca Cola"
}

Searching
match hits all documents:

GET /testindex/_search
{
  "query": {
    "match": {
      "name": "Coca-Cola"
    }
  }
}

wildcard hits no documents (nor without the wildcards):

GET /testindex/_search
{
  "query": {
    "wildcard": {
      "name": {
        "value": "*Coca-Cola*"
      }
    }
  }
}

removing the hyphen hits all documents:

GET /testindex/_search
{
  "query": {
    "wildcard": {
      "name": {
        "value": "*CocaCola*"
      }
    }
  }
}

Any suggestions?

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.