Accent insensitive search with search analyzer


#1

Hi,

I have some problems to configure my ES index.
I want to perform a case-insensitive, accent-insensitive with special charachters search.

Here is the index creation script :

DELETE index-test
PUT index-test
{
  "settings": {
		"analysis": {
			"analyzer": {
				"SearchAnalyzer": {
					"type": "custom",
					"filter": ["lowercase", "asciifolding"],
					"tokenizer": "whitespace"
				},
				"IndexAnalyzer": {
					"type": "custom",
					"filter": ["lowercase", "asciifolding"],
					"tokenizer": "whitespace"
				}
			}
		}
	},
	"mappings": {
		"myType": {
			"properties": {
				"myField": {
					"type": "text",
					"analyzer": "IndexAnalyzer",
					"search_analyzer": "SearchAnalyzer"
				}
			}
		}
	}
}
PUT index-test/myType/1
{
  "myField": "aB-é"
}

Consider this search request :

GET index-test/myType/_search
{
	"query": {
		"bool": {
			"filter": [{
				"wildcard": {
					"myField": {
						"value": "*{charToSearch}*"
					}
				}
			}]
		}
	}
}

It works as expected if you replace {charToSearch} with these characters : "a", "A", "b", "B", "-", "e".
Unfortunately, that does not work with "é".
It's like the searchAnalyzer filter "asciifolding" is not performed.

EDIT : sorry for the poor formating of my first message and for the delay

Thanks for any help, Valentin.


(David Pilato) #2

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Could you provide a full recreation script as described in

It will help to better understand what you are doing.
Please, try to keep the example as simple as possible.


#3

Hi,

I added a recreation script and formatted the code.

Thanks for any help !


(David Pilato) #4

Most likely it's because the wildcard query is not analyzed. See https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-wildcard-query.html


#5

I didn't see that..

An alternative could be to perform the "asciifolding" on the client side.
Do you have a better idea to perform this search ?


(Val Crettaz) #6

Do not use a wildcard query but include an ngram token filter in IndexAnalyzer instead so that your text is sliced and diced into smaller text chunks all while being ascii-folded. No need to change the SearchAnalyzer though.


#7

Thanks for the suggestion, i will give it a try.


#8

The searchAnalyzer seems to work like I wanted with the ngram token filter.

I need to search for long strings (a GUID for example -> c163e2b5-5362-e556-490a-867a9cd63bc3), which is 36 charachters.

  • What is the max "max_gram" value to not exceed to avoid performance problem ?
  • Is there a better solution than nGram to perform "contains query" for big values ?

(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.