Edge NGram Tokenizer not Tokenizing Digits?

We're using the Edge NGram tokenizer (ElasticSearch 7.3 Windows) on letters and digits from 3 to 10 characters. However when we upload a text test 11kw and search for text: test AND text: 11k we find nothing.

Interestingly even the full word finds nothing: text: test AND text: 11kw

Document {"text":"test kw11"} and searching text: test AND text: kw11 also brings no results.

However, with {"text":"test kwak"} and searching text: test AND text: kwa DOES BRING RESULTS!

PUT /myindex
{
    "settings": {
        "analysis": {
            "analyzer": {
                "autocomplete": {
                    "tokenizer": "autocomplete",
                    "filter": ["lowercase"]
                },
                "autocomplete_search": {
                    "tokenizer": "lowercase"
                }
            },
            "tokenizer": {
                "autocomplete": {
                    "type": "edge_ngram",
                    "min_gram": 3,
                    "max_gram": 10,
                    "token_chars": ["letter", "digit"]
                }
            }
        }
    },
    "mappings": {
        "dynamic": false,
        "properties": {
            "text": {
                "type": "text",
                "analyzer": "autocomplete",
                "search_analyzer": "autocomplete_search"
            }
        }
    }
}

Upload a document:

PUT /myindex/_bulk
{"index":{"_id":"test11"}}
{"text":"test 11kw"}`

Search:

GET /myindex/_search?q=text: test AND text: 11kw

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

When we analyze the string "11kw" we see it is tokenized as expected:

GET /_analyze
{
  "analyzer": "autocomplete",
  "text": "test 11KW"
}

{
    "tokens": [
        {
            "token": "tes",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "test",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "11k",
            "start_offset": 5,
            "end_offset": 8,
            "type": "word",
            "position": 2
        },
        {
            "token": "11kw",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 3
        }
    ]
}

Have a look at this:

GET /myindex/_analyze
{
  "analyzer": "autocomplete_search",
  "text": "test 11KW"
}

It gives:

{
  "tokens" : [
    {
      "token" : "test",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "kw",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    }
  ]
}

test is found but kw is not.
The lowercase tokenizer simply ignores everything which is not a letter. See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenizer.html

The solution seems to be to change the search_analyzer to standard sorta contrary to the Edge NGRam docs example which uses the autocomplete_search search_analyzer (lowercase tokenizer).

So the mapping is:

    "text": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "standard"
    }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.