Keyword case insensitive search with non-ascii

Volodymyr_Kovtun · August 23, 2023, 10:17am

Hi

Could you please clarify if it is possible to have case-insensitive term.keyword search with non-ascii symbols? It seems not to work.

Here is the example:
Indexing 4 documents (2 ascii and 2 non-ascii)

>>> es.index(index='testi',  document={'text': 'ċ'})   # non-ascii
>>> es.index(index='testi',  document={'text': 'Ċ'})   # non-ascii
>>> es.index(index='testi',  document={'text': 'a'})   # ascii
>>> es.index(index='testi',  document={'text': 'A'})   # ascii

It works well with ascii symbols

>>> pprint(es.search(index='testi', query={"bool":{"must":[{"term":  {'text.keyword': {'value': 'a',  'case_insensitive': 'true'  }  }}]}})['hits'])
{'hits': [{'_id': 'QV3ZIYoBpIJf8GfZm98T',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'a'}},
          {'_id': 'Ql3ZIYoBpIJf8GfZpd8n',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'A'}}],
 'max_score': 1.0,
 'total': {'relation': 'eq', 'value': 2}}

and seems to now work with non-ascii

>>> pprint(es.search(index='testi', query={"bool":{"must":[{"term":  {'text.keyword': {'value': 'ċ',  'case_insensitive': 'true'  }  }}]}})['hits'])
{'hits': [{'_id': 'P13YIYoBpIJf8GfZbd8D',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'ċ'}}],
 'max_score': 1.0,
 'total': {'relation': 'eq', 'value': 1}}

>>> pprint(es.search(index='testi', query={"bool":{"must":[{"term":  {'text.keyword': {'value': 'Ċ',  'case_insensitive': 'true'  }  }}]}})['hits'])
{'hits': [{'_id': 'QF3ZIYoBpIJf8GfZgt_s',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'Ċ'}}],
 'max_score': 1.0,
 'total': {'relation': 'eq', 'value': 1}}

All settings/mappings are default.

Christian_Dahlqvist · August 23, 2023, 10:24am

I suspect you may need to define a suitable (possibly custom) normalizer in your mappings.

Volodymyr_Kovtun · August 23, 2023, 10:38am

>>> es.indices.analyze(text='Ċ', analyzer='default')
ObjectApiResponse({'tokens': [{'token': 'ċ', 'start_offset': 0, 'end_offset': 1, 'type': '<ALPHANUM>', 'position': 0}]})
>>> es.indices.analyze(text='ċ', analyzer='default')
ObjectApiResponse({'tokens': [{'token': 'ċ', 'start_offset': 0, 'end_offset': 1, 'type': '<ALPHANUM>', 'position': 0}]})

Does not this mean that text is analyzed as expected ?

match seems to work as expected too

>>> pprint(es.search(index='testi', query={"bool":{"must":[{"match":  {'text': 'ċ' }}]}})['hits'])
{'hits': [{'_id': 'P13YIYoBpIJf8GfZbd8D',
           '_index': 'testi',
           '_score': 0.6931471,
           '_source': {'text': 'ċ'}},
          {'_id': 'QF3ZIYoBpIJf8GfZgt_s',
           '_index': 'testi',
           '_score': 0.6931471,
           '_source': {'text': 'Ċ'}}],
 'max_score': 0.6931471,
 'total': {'relation': 'eq', 'value': 2}}

Volodymyr_Kovtun · August 25, 2023, 12:17pm

Am I wrong regarding analyzers ?
Can someone explain how adding custom normalizer will resolve the issue ?

Christian_Dahlqvist · August 25, 2023, 12:34pm

The example you are showing here uses the default analyzer, which seems to provide the expected result with respect to lower casing.

The case insensitive option within the term query you initially showed, does as far as I know not use the default analyser, so the behaviour may be different. Based on your initial post it would seem like it only is able to handle lower-casing of ascii characters, which is why I suggested creating a separate lowercased multifield in your mappings that use a char filter to ensure you get the behaviour you are expecting.

It could be that this is a bug or maybe a known limitation.

Mark_Harwood1 · August 26, 2023, 7:49am

The case insensitive setting on the term query only works for ASCII text and is documented as such.
It was introduced mainly as a simple tool to help with searching keyword fields where no attempts to process the content had been made by the administrator (tokenising, stemming, normalising etc).
It’s not intended to be an alternative to normalizers or analyzers which are the tools required to deal with a variety of languages and content

system · September 23, 2023, 7:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Non-case sensitive search on termquery Elasticsearch	4	7487	March 4, 2017
Case insensitive search and doc_values Elasticsearch	3	1268	July 5, 2017
Search a keyword with insensitive word Elasticsearch	6	337	April 2, 2020
Case_insensitive not working on wildcard field type with cyrilic data Elasticsearch	4	541	April 24, 2023
Case insensitive search on keyword Elasticsearch	4	7441	May 8, 2021

Keyword case insensitive search with non-ascii

Related Topics