Hunspell russian language. Problem with some grammar cases

I use Elasticsearch 7.17.4 with docker and hunspell for russian language.
My settings for index analysis:

"analysis": {
        "filter": {
          "my_stemmer": {
            "type": "stemmer",
            "language": "russian"
          },
          "ru_RU": {
            "locale": "ru_RU",
            "type": "hunspell"
          }
        },
        "analyzer": {
          "custom_analyzer": {
            "filter": [
              "lowercase",
              "ru_RU",
              "my_stemmer"
            ],
            "char_filter": [
              "html_strip"
            ],
            "tokenizer": "standard"
          }
        }
      }

Unfortunately FTS does not work properly with all words: for example I have the following entries:
новый колодец
нового колодца
новому колодцу
новым колодцем
новом колодце

When I make the following request:

GET http://localhost:9200/ingredient/_search?pretty
Content-Type: application/json

{
  "query": {
    "query_string": {
      "query": "колодец",
      "default_field": "name"
    }
  }
}

I get the following result:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1.8472799,
    "hits": [
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "35",
        "_score": 1.8472799,
        "_source": {
          "name": "новый колодец",
          "id": 35,
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "36",
        "_score": 1.8472799,
        "_source": {
          "name": "нового колодца",
          "id": 36,
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "37",
        "_score": 1.8472799,
        "_source": {
          "name": "новому колодцу",
          "id": 37,
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "39",
        "_score": 1.8472799,
        "_source": {
          "name": "новом колодце",
          "id": 39,
          "_meta": {}
        }
      }
    ]
  }
}


Response code: 200 (OK); Time: 65ms; Content length: 1235 bytes

I never get "новым колодцем".
My question is: Is it the problem with my hunspell? Should I find any version of it with more data? Or is it a problem with my settings? Maybe I missed something?

Hi @GrigoriyKrasovskiy

I would track the tokens generated by the custom_analyzer for each input term. That way, you would be sure that it would be generating the expected tokens.

Anyway I saw this attempt:

If available, we recommend trying an algorithmic stemmer for your language before using the hunspell token filter. In practice, algorithmic stemmers often outperform dictionary stemmers. See dictionary stemmers.

Have you tried using fuzzines to get the fifth doc in your answers?

{
  "query": {
    "query_string": {
      "query": "колодец~",
      "default_field": "name",
      "fuzziness": 1
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.