Why is Elasticsearch not using analyzer tokens during search?

nikola1 · May 20, 2024, 4:53pm

index/_analyze

    {
      "analyzer": "autocomplete",
      "field": "name",
      "text": "жаб"
      
    }

gives me correct tokens:

    {
        "tokens": [{
            "token": "žab",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        }]
    }

now when i plugin in token žab into search it works, it is giving me the results.

Explain query:

index/_validate/query?explain

    {
      "query": {
        "match": {
          "name": "жаб"
        }
      }
    }

is giving me

    {
        "_shards": {
            "total": 1,
            "successful": 1,
            "failed": 0
        },
        "valid": true,
        "explanations": [
            {
                "index": "places_for_search",
                "valid": true,
                "explanation": "name:žab"
            }
        ]
    }

so it is converted into žab.

but when i have


    {
      "query": {
        "match": {
          "name": "жаб"
        }
      }
    }

it does not give me any results but it should, since žab gives me results.

i even tried forcing analyzer like this:

    {
      "query": {
        "multi_match": {
          "query": "жаб",
          "type": "bool_prefix",
          "fields": [
            "name"
          ],
          "analyzer": "autocomplete"
        }
      }
    }

but still, no results.

Just for sake of argument my document looks like:

    {
       "id": "ChIJT3DV8zM5TRMRlVS4y79AH7A",
        "name": "Žabljak"
    }

field name has autocomplete analyzer:

    {
        "type": "search_as_you_type",
         "doc_values": false,
         "max_shingle_size": 3,
         "analyzer": "autocomplete"
    }

My analyzer:

    {
      "analyzer": {
        "autocomplete": {
          "filter": [
            "autocomplete",
            "trim",
            "asciifolding",
            "lowercase",
            "serbian_stemmer",
            "russian_stemmer"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }

My filters

    {
      "filter": {
        "russian_stemmer": {
          "type": "stemmer",
          "language": "russian"
        },
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": "3",
          "max_gram": "15"
        },
        "serbian_stemmer": {
          "type": "stemmer",
          "language": "serbian"
        }
      }
    }

So, bottom line, the analyzer works, but its tokens are not being used in the query. Any idea how to debug this further? I am using es ver 8.13.3

nikola1 · May 21, 2024, 11:16am

I'm unsure about the exact issue with using the stemmer type for language in this solution. It seems only an Elasticsearch expert could provide insight. However, after extensive trial and error, I've managed to configure an analyzer that effectively addresses the problem.

POST index/_settings

    {
      "analysis": {
        "filter": {
          "russian_stemmer": {
            "type": "snowball",
            "language": "Russian"
          },
          "searbian_stemmer": {
            "type": "snowball",
            "language": "Russian"
          },
          "cyrillic_to_latin": {
            "type": "icu_transform",
            "id": "Any-Latin; Latin-ASCII"
          },
          "autocomplete": {
            "type": "edge_ngram",
            "min_gram": "3",
            "max_gram": "15"
          }
        },
        "analyzer": {
          "cyrillic_stemmed_transliterator": {
            "filter": [
              "autocomplete",
              "lowercase",
              "cyrillic_to_latin",
              "russian_stemmer",
              "searbian_stemmer",
              "lowercase",
              "asciifolding"
            ],
            "tokenizer": "standard"
          }
        }
      }
    }

POST index/_mappings

    {
      "properties": {
        "name": {
          "type": "search_as_you_type",
          "analyzer": "cyrillic_stemmed_transliterator"
        }
      }
    }

Now if you have doc like

    {
       "id": "ChIJT3DV8zM5TRMRlVS4y79AH7A",
        "name": "Žabljak"
    }

and you do your search as:

    {
      "query": {
        "multi_match": {
          "query": "жаб", // no matter what you put in here, zab or žab, or жаб
          "type": "bool_prefix",
          "fields": [
            "name"
          ]
        }
      }
    }

you will get correct document.

Purpose of this answer and above question:

If you have property names written in Latin characters but you want to search for them using Cyrillic letters or ascii, you're essentially seeking a way to seamlessly bridge the two character sets for effective search functionality.

system · June 18, 2024, 11:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.