Stopwords do not work in Ukrainian

I have such a problem. There is a product: Папір офісний Double A, A5 (148 х 210 мм), Premium 80г/м2 500 аркушів

I make a search request: Папір офісний 500 аркушів - everything is fine, it is found.

But when I make a request: Папір офісний на 500 аркушів - the product is not found.

What could be the problem. I try to add it to stopwords but it doesn't give any result.

I need such words as: "on", "in", "and"... Did not affect the search query in any way. They were generally ignored in the search query

Image: elasticsearch:7.9.1

Here is the code to build the index:

public function createIndex()
    {
        $mappingParams = [
            'index' => $this->getIndexName(),
            'body' => [
                'mappings' => [
                    '_source' => [
                        'enabled' => true
                    ],
                    'properties' => $this->getMappingProperties(),
                ],
                'settings' => [
                    'analysis' => [
                        'normalizer' => [
                            'lowercase_keyword' => [
                                'type' => 'custom',
                                'filter' => ['lowercase', 'trim'],
                            ],
                        ],
                        'tokenizer' => [
                            'ngram_tokenizer' => [
                                'type' => 'edge_ngram',
                                'min_gram' => 1,
                                'max_gram' => 15,
                                'token_chars' => [
                                    'letter',
                                    'digit',
                                ],
                            ],
                        ],
                        'filter' => [
                            'synonym_filter' => [
                                'type' => 'synonym',
                                'synonyms' => [
                                    'аркуш, сторінка', 'арк, ст', 'аркуш, ст', 'сторінка, арк', 'автомобіль, машина',
                                    'тетрадь, зошит', 'кошелек, гаманець', 'на' => ' ',
                                ],
                            ],
                            'uk_stopwords' => [
                                'type' => 'stop',
                                'stopwords' => ['на', 'та', 'і', 'at', 'TY358'],
                            ],
                        ],
                        'analyzer' => [
                            'ngram_analyzer' => [
                                'type' => 'custom',
                                'tokenizer' => 'ngram_tokenizer',
                                'filter' => [
                                    'lowercase',
                                    'trim',
                                    'synonym_filter',
                                    'uk_stopwords',
                                    'stop',
                                ],
                            ],
                        ],
                    ],
                    'index' => [
                        'max_result_window' => intval(\Variable::getArray('settings.elasticsearch.index_size', 100000)),
                    ],
                ],
            ],
        ];

        return $this->getClient()?->indices()->create($mappingParams);
    }

Here is the mapping for the index:

protected array $mappingProperties = [
        'id' => [
            'type' => 'keyword',
        ],
        'name' => [
            'type' => 'text',
            'fielddata' => true,
            'analyzer' => 'ngram_analyzer',
            'search_analyzer' => 'standard',
            'fields' => [
                'keyword' => [
                    'type' => 'keyword',
                    'normalizer' => 'lowercase_keyword',
                ]
            ],
        ],
        'name_ru' => [
            'type' => 'text',
            'fielddata' => true,
            'analyzer' => 'ngram_analyzer',
            'search_analyzer' => 'standard',
            'fields' => [
                'keyword' => [
                    'type' => 'keyword',
                    'normalizer' => 'lowercase_keyword',
                ]
            ],
        ],
        'body' => [
            'type' => 'text',
        ],
        'body_ru' => [
            'type' => 'text',
        ],
        'price' => [
            'type' => 'object',
        ],
        'extern_id' => [
            'type' => 'keyword',
            'fields' => [
                'long' => [
                    'type' => 'long',
                ]
            ],
        ],
        'gtin' => [
            'type' => 'keyword',
        ],
        'artikul' => [
            'type' => 'text',
            'fielddata' => true,
            'fields' => [
                'keyword' => [
                    'type' => 'keyword',
                    'normalizer' => 'lowercase_keyword',
                ]
            ],
        ],
        'gpc' => [
            'type' => 'integer',
        ],
        'rating' => [
            'type' => 'float',
        ],
        'status' => [
            'type' => 'keyword',
        ],
        'availability' => [
            'type' => 'integer',
        ],
        'created_at' => [
            'type' => 'date',
        ],
        'category_id' => [
            'type' => 'keyword',
        ],
        'categories_ids' => [
            'type' => 'keyword',
        ],
        'brand_id' => [
            'type' => 'keyword',
        ],
        'properties' => [
            'type' => 'object',
        ],
        'is_feed' => [
            'type' => 'boolean',
        ],
        'is_prior' => [
            'type' => 'boolean',
        ],
        'is_new' => [
            'type' => 'boolean',
        ],
        'is_action' => [
            'type' => 'boolean',
        ],
        'is_popular' => [
            'type' => 'boolean',
        ],
        'is_showonmain' => [
            'type' => 'boolean',
        ],
        'is_freedelivery' => [
            'type' => 'boolean',
        ],
        'has_in_gurt' => [
            'type' => 'boolean',
        ],
        'has_in_fop' => [
            'type' => 'boolean',
        ],
    ];

This is how a search query is built by name (text field):

$scoreSort = false;
        $query = [];

        if ($value = Arr::get($params, 'q')) {
            $termQuery = [
                'query' => [
                    'term' => [
                        'extern_id' => [
                            'value' => $value
                        ]
                    ]
                ]
            ];

            $termResponse = $this->searchOnElasticsearch($termQuery);

            $scoreSort = true;
            if ($termResponse['hits']['total']['value'] > 0) {
                $query['query']['bool']['must'][] = [
                    'term' => [
                        'extern_id' => substr($value, 0, 100)
                    ]
                ];
            } else {
                if (($locale = app()->getLocale()) === 'uk') {
                    $nameField = 'name';
                    $fields = ['extern_id^15', 'gtin^10', 'artikul^10', "{$nameField}^5"];
                } else {
                    $nameField = "name_{$locale}";
                    $fields = ['extern_id^15', 'gtin^10', 'artikul^10', "{$nameField}^5"];
                }

                $query['query']['bool']['must'][] = [
                    'bool' => [
                        'should' => [
                            [
                                'term' => [
                                    "{$nameField}.keyword" => $value
                                ]
                            ],
                            [
                                'multi_match' => [
                                    'fields' => $fields,
                                    'query' => substr($value, 0, 100),
                                    'fuzziness' => $this->getFuzziness(substr($value, 0, 100)),
                                    'prefix_length' => 3, 
                                    'operator' => 'AND',
                                    'analyzer' => 'ngram_analyzer',
                                ],
                            ]
                        ]
                    ]
                ];
            }
        }

Hi @Ivan_Kachula

When you run your analyzer and want to filter some stop words, you will notice that character "н"was created. This is related to the token char letter, I have not investigated the reason but you can start from there.

Run this query:

GET /_analyze
{
  "text": [
    "Папір офісний на 500 аркушів-"
  ],
  "tokenizer": {
    "type": "edge_ngram",
    "min_gram": 1,
    "max_gram": 15,
    "token_chars": [
      "letter",
      "digit"
    ]
  },
  "filter": [
    {
      "type": "stop",
      "stopwords": [
        "на",
        "та",
        "і",
        "at",
        "TY358"
      ]
    }
  ]
}

So, if you want to validate, just add the character "н" to the stopwords list and both search queries will work.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.