Hunspell russian language. Problem with some grammar cases

GrigoriyKrasovskiy · June 18, 2022, 5:16pm

I use Elasticsearch 7.17.4 with docker and hunspell for russian language.
My settings for index analysis:

"analysis": {
        "filter": {
          "my_stemmer": {
            "type": "stemmer",
            "language": "russian"
          },
          "ru_RU": {
            "locale": "ru_RU",
            "type": "hunspell"
          }
        },
        "analyzer": {
          "custom_analyzer": {
            "filter": [
              "lowercase",
              "ru_RU",
              "my_stemmer"
            ],
            "char_filter": [
              "html_strip"
            ],
            "tokenizer": "standard"
          }
        }
      }

Unfortunately FTS does not work properly with all words: for example I have the following entries:
новый колодец
нового колодца
новому колодцу
новым колодцем
новом колодце

When I make the following request:

GET http://localhost:9200/ingredient/_search?pretty
Content-Type: application/json

{
  "query": {
    "query_string": {
      "query": "колодец",
      "default_field": "name"
    }
  }
}

I get the following result:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1.8472799,
    "hits": [
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "35",
        "_score": 1.8472799,
        "_source": {
          "name": "новый колодец",
          "id": 35,
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "36",
        "_score": 1.8472799,
        "_source": {
          "name": "нового колодца",
          "id": 36,
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "37",
        "_score": 1.8472799,
        "_source": {
          "name": "новому колодцу",
          "id": 37,
          "_meta": {}
        }
      },
      {
        "_index": "ingredient",
        "_type": "_doc",
        "_id": "39",
        "_score": 1.8472799,
        "_source": {
          "name": "новом колодце",
          "id": 39,
          "_meta": {}
        }
      }
    ]
  }
}


Response code: 200 (OK); Time: 65ms; Content length: 1235 bytes

I never get "новым колодцем".
My question is: Is it the problem with my hunspell? Should I find any version of it with more data? Or is it a problem with my settings? Maybe I missed something?

RabBit_BR · June 20, 2022, 2:13am

Hi @GrigoriyKrasovskiy

I would track the tokens generated by the custom_analyzer for each input term. That way, you would be sure that it would be generating the expected tokens.

Anyway I saw this attempt:

If available, we recommend trying an algorithmic stemmer for your language before using the hunspell token filter. In practice, algorithmic stemmers often outperform dictionary stemmers. See dictionary stemmers.

Have you tried using fuzzines to get the fifth doc in your answers?

{
  "query": {
    "query_string": {
      "query": "колодец~",
      "default_field": "name",
      "fuzziness": 1
    }
  }
}

system · July 18, 2022, 2:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hunspell filter problem Elasticsearch	5	821	July 5, 2017
Hunspell Elasticsearch	11	4138	December 8, 2017
Cannot make hunspell to work Elasticsearch	12	585	July 6, 2017
Hunspell analyzer Elasticsearch	3	745	July 5, 2017
Russian search does not work for me Elasticsearch	3	1665	July 6, 2017

Hunspell russian language. Problem with some grammar cases

Related topics