Fuzziness in query_string

roru · December 13, 2018, 10:24am

Hi all,

I am using query_string for my queries, because I want the user to be able to use operators like AND and OR.

The problem is, that if a user searches for "nestle", he won't get any results, because it is actually written as "nestlé". So the search term has to be exactly "nestlé" and not "nestle".

So I thought I might add some fuzziness. Would you agree, or do you have a better solution for handling accents?

The problem with fuzziness is, that in query_string the user must add the tide ~ himself after a word, but he doesn't know about it.

So I am looking for a way to automatically add a ~ after each token. Is this possible?

Best regards ,
Roger

klof · December 13, 2018, 12:38pm

Hi,
If your problem is just the "nestlé" use case, I think you don't need to use the fuzziness.

What you could do, is in your mapping to add the ascii folding token filter with the preserve_original parameter equals true.
In that case :
nestlé will be indexed as nestlé and nestle

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html#analysis-asciifolding-tokenfilter

roru · December 13, 2018, 1:32pm

Thank you @klof for your answer.

The thing is, I am using elasticpress Wordpress plugin which has the settings/mappings part preset when i start the indexing process. I tried to "hack" the plugin, but it still does not work with "nestle".

I'm adding part of the mappings file from the plugin so you can see what i have added
(see line 8 in analyzer=>default=>filter and further down in "filter" where i put my custom asciifolder filter)

have i done something wrong?

'settings' => array(
	'index.mapping.total_fields.limit' => apply_filters( 'ep_total_field_limit', 5000 ),
	'index.max_result_window' => apply_filters( 'ep_max_result_window', 1000000 ),
	'analysis' => array(
		'analyzer' => array(
			'default' => array(
				'tokenizer' => 'standard',
				'filter' => array( 'standard', 'ewp_word_delimiter', 'lowercase', 'stop', 'ewp_snowball', 'my_ascii_folding' ),
				'char_filter' => array( 'html_strip' ),
				'language' => apply_filters( 'ep_analyzer_language', 'english', 'analyzer_default' ),
			),
			'shingle_analyzer' => array(
				'type' => 'custom',
				'tokenizer' => 'standard',
				'filter' => array( 'lowercase', 'shingle_filter' ),
			),
			'ewp_lowercase' => array(
				'type' => 'custom',
				'tokenizer' => 'keyword',
				'filter' => array( 'lowercase' ),
			),
		),
		'filter' => array(
            'my_ascii_folding' => array(
                "type" => "asciifolding",
                "preserve_original" => true,
            ),
			'shingle_filter' => array(
				'type' => 'shingle',
				'min_shingle_size' => 2,
				'max_shingle_size' => 5,
			),
			'ewp_word_delimiter' => array(
				'type' => 'word_delimiter',
				'preserve_original' => true,
			),
			'ewp_snowball' => array(
				'type' => 'snowball',
				'language' => apply_filters( 'ep_analyzer_language', 'english', 'filter_ewp_snowball' ),
			),
			'edge_ngram' => array(
				'side' => 'front',
				'max_gram' => 10,
				'min_gram' => 3,
				'type' => 'edgeNGram',
			),
		),
		'normalizer' => array(
			'lowerasciinormalizer' => array(
				'type'   => 'custom',
				'filter' => array( 'lowercase', 'asciifolding' ),
			),
		),
	),
), 
[ ... ]

roru · December 13, 2018, 2:37pm

I solved it by removing the Snowball Filter. After re-indexing, it works now. "nestle" will find articles with "nestlé".
Don't know if this is the best solution though

system · January 10, 2019, 2:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query_string query with fuzzy matching enabled without explicit `~` operators Elasticsearch	1	379	October 2, 2019
Grammer + query_string with fuzzy '~' Elasticsearch	1	369	May 9, 2018
Query String with - _ symbols Elasticsearch	3	785	July 6, 2017
Query_string - fuzzy - stemmer Elasticsearch	1	966	July 19, 2018
Query string query: default fuzziness? Elasticsearch	5	477	October 30, 2018

Fuzziness in query_string

Related topics