Accent insensitive search with search analyzer

Valentin · December 22, 2017, 4:49pm

Hi,

I have some problems to configure my ES index.
I want to perform a case-insensitive, accent-insensitive with special charachters search.

Here is the index creation script :

DELETE index-test
PUT index-test
{
  "settings": {
		"analysis": {
			"analyzer": {
				"SearchAnalyzer": {
					"type": "custom",
					"filter": ["lowercase", "asciifolding"],
					"tokenizer": "whitespace"
				},
				"IndexAnalyzer": {
					"type": "custom",
					"filter": ["lowercase", "asciifolding"],
					"tokenizer": "whitespace"
				}
			}
		}
	},
	"mappings": {
		"myType": {
			"properties": {
				"myField": {
					"type": "text",
					"analyzer": "IndexAnalyzer",
					"search_analyzer": "SearchAnalyzer"
				}
			}
		}
	}
}
PUT index-test/myType/1
{
  "myField": "aB-é"
}

Consider this search request :

GET index-test/myType/_search
{
	"query": {
		"bool": {
			"filter": [{
				"wildcard": {
					"myField": {
						"value": "*{charToSearch}*"
					}
				}
			}]
		}
	}
}

It works as expected if you replace {charToSearch} with these characters : "a", "A", "b", "B", "-", "e".
Unfortunately, that does not work with "é".
It's like the searchAnalyzer filter "asciifolding" is not performed.

EDIT : sorry for the poor formating of my first message and for the delay

Thanks for any help, Valentin.

dadoonet · December 22, 2017, 5:11pm

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Could you provide a full recreation script as described in

It will help to better understand what you are doing.
Please, try to keep the example as simple as possible.

Valentin · January 2, 2018, 8:28am

Hi,

I added a recreation script and formatted the code.

Thanks for any help !

dadoonet · January 2, 2018, 9:37am

Most likely it's because the wildcard query is not analyzed. See https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-wildcard-query.html

Valentin · January 2, 2018, 10:11am

I didn't see that..

An alternative could be to perform the "asciifolding" on the client side.
Do you have a better idea to perform this search ?

val · January 2, 2018, 10:32am

Do not use a wildcard query but include an ngram token filter in IndexAnalyzer instead so that your text is sliced and diced into smaller text chunks all while being ascii-folded. No need to change the SearchAnalyzer though.

Valentin · January 2, 2018, 10:35am

Thanks for the suggestion, i will give it a try.

Valentin · January 2, 2018, 12:49pm

The searchAnalyzer seems to work like I wanted with the ngram token filter.

I need to search for long strings (a GUID for example -> c163e2b5-5362-e556-490a-867a9cd63bc3), which is 36 charachters.

What is the max "max_gram" value to not exceed to avoid performance problem ?
Is there a better solution than nGram to perform "contains query" for big values ?

system · January 30, 2018, 12:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Word with accent and searching Elasticsearch	5	1126	July 6, 2017
Index analyzer problem with accent! Elasticsearch	1	362	July 6, 2017
Accent-insensitive search Elasticsearch	2	6790	March 9, 2020
Problem searching queries with accents Elasticsearch	10	13148	July 6, 2017
Case sensitivity Elasticsearch	3	1205	July 6, 2017

Accent insensitive search with search analyzer

Related topics