Search as you type for documents with digits, unicode and special characters

Hi! Im very new to the ES and while learning and playing around with it, I got stuck with a problem that i'm not sure how to solve.

REQUIREMENT

I'm trying to build search-as-you-type autocomplete. I have a table with one column “name”.

Here is a list of example documents I have:

  1. O/Purist Tsipouro
  2. Southside
  3. South Side
  4. Bénédictine
  5. Piña Colada
  6. Bee's Knees
  7. Beer
  8. 49th Street
  9. 49 Warriors
  10. 3 Dolla

Few requirements on how I want it to work.
Input - Expected output examples

  • [“O/P”, “o / pu”, “o/p”] -> [“O/Purist Tsipouro”, … ]
  • [“Southside”] -> [“Southside”, “South side”, … ]
  • [“South side”] -> [“South side”, “Southside”, … ]
  • ["benedict", "ben”, “Bénédictine”] -> [“Bénédictine”, … ]
  • [“Bee’s”, “bees”, ] -> [“Bee's Knees”]

CURRENT SETUP

Here is my setting:

settings: {
  analysis: {
   char_filter: {
    "my_char_filter": {
     type: "mapping",
     mappings: [
      "' => ",
      "’ => ",
     ]
    }
   },
   filter: {
    "my_word_delimiter": {
     type: "word_delimiter",
    }
   },
   analyzer: {
    "autocomplete": {
     type: "custom",
     tokenizer: "autocomplete_tokenizer",
     char_filter: ["my_char_filter"],
     filter: ["lowercase", "asciifolding", "my_word_delimiter"],
    },
   },
   tokenizer: {
    "autocomplete_tokenizer": {
     type: "edge_ngram",
     min_gram: 1,
     max_gram: 20,
     token_chars: ["letter", "digit"],
    }
   }
  },
 },
 mappings: {
  properties: {
   name: { type: "text", analyzer: "autocomplete" },
  }
 }

Here is my query:

query: {
    match: {
     name: {
      query: search_query,
      analyzer: "autocomplete",
      boost: 1
     }
    }
   }

PROBLEM

It works great until I index anything with digits, like “3 dolla”, “49th street”.

If I do, im getting this error:

{"type"=>"illegal_argument_exception", "reason"=>"startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=2,lastStartOffset=2 for field 'name'"} on item with id …

If I understand correctly, its the problem with edge_ngram tokenizer. I’ve tried to move it to the filter and it resolves the error but then the quality of the search results is simply terrible.
I would appreciate if someone could help me fix this or point to the right direction in docs/google.

Im super lost and stuck.
Thank you

Hi @zdebyman

Perhaps this example could be a good place to start. You say you want to have a search-as-you-type autocomplete but I don't see the search-as-you-type in your mapping.

The example below uses your char filter + search-as-you-type. Maybe with some changes you can already get the expected results.

Mapping:

PUT test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "' => ",
            "’ => "
          ]
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "search_as_you_type",
        "analyzer": "autocomplete"
      }
    }
  }
}

Docs:

POST test/_bulk
{"index":{}}
{"my_field":"O/Purist Tsipouro"}
{"index":{}}
{"my_field":"Southside"}
{"index":{}}
{"my_field":"South Side"}
{"index":{}}
{"my_field":"Bénédictine"}
{"index":{}}
{"my_field":"Piña Colada"}
{"index":{}}
{"my_field":"Bee's Knees"}
{"index":{}}
{"my_field":"Beer"}
{"index":{}}
{"my_field":"49th Street"}
{"index":{}}
{"my_field":"49 Warriors"}
{"index":{}}
{"my_field":"3 Dolla"}

Query

GET test/_search
{
  "query": {
    "multi_match": {
      "query": "bee's",
      "type": "bool_prefix",
      "fields": [
        "my_field",
        "my_field._2gram",
        "my_field._3gram"
      ]
    }
  }
}

This is great! Added word_delimiter filter and it pretty much did what i want. Thanks!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.