Keyword case insensitive search with non-ascii

Hi

Could you please clarify if it is possible to have case-insensitive term.keyword search with non-ascii symbols? It seems not to work.

Here is the example:
Indexing 4 documents (2 ascii and 2 non-ascii)

>>> es.index(index='testi',  document={'text': 'ċ'})   # non-ascii
>>> es.index(index='testi',  document={'text': 'Ċ'})   # non-ascii
>>> es.index(index='testi',  document={'text': 'a'})   # ascii
>>> es.index(index='testi',  document={'text': 'A'})   # ascii

It works well with ascii symbols

>>> pprint(es.search(index='testi', query={"bool":{"must":[{"term":  {'text.keyword': {'value': 'a',  'case_insensitive': 'true'  }  }}]}})['hits'])
{'hits': [{'_id': 'QV3ZIYoBpIJf8GfZm98T',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'a'}},
          {'_id': 'Ql3ZIYoBpIJf8GfZpd8n',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'A'}}],
 'max_score': 1.0,
 'total': {'relation': 'eq', 'value': 2}}

and seems to now work with non-ascii

>>> pprint(es.search(index='testi', query={"bool":{"must":[{"term":  {'text.keyword': {'value': 'ċ',  'case_insensitive': 'true'  }  }}]}})['hits'])
{'hits': [{'_id': 'P13YIYoBpIJf8GfZbd8D',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'ċ'}}],
 'max_score': 1.0,
 'total': {'relation': 'eq', 'value': 1}}
>>> pprint(es.search(index='testi', query={"bool":{"must":[{"term":  {'text.keyword': {'value': 'Ċ',  'case_insensitive': 'true'  }  }}]}})['hits'])
{'hits': [{'_id': 'QF3ZIYoBpIJf8GfZgt_s',
           '_index': 'testi',
           '_score': 1.0,
           '_source': {'text': 'Ċ'}}],
 'max_score': 1.0,
 'total': {'relation': 'eq', 'value': 1}}

All settings/mappings are default.

I suspect you may need to define a suitable (possibly custom) normalizer in your mappings.

>>> es.indices.analyze(text='Ċ', analyzer='default')
ObjectApiResponse({'tokens': [{'token': 'ċ', 'start_offset': 0, 'end_offset': 1, 'type': '<ALPHANUM>', 'position': 0}]})
>>> es.indices.analyze(text='ċ', analyzer='default')
ObjectApiResponse({'tokens': [{'token': 'ċ', 'start_offset': 0, 'end_offset': 1, 'type': '<ALPHANUM>', 'position': 0}]})

Does not this mean that text is analyzed as expected ?

match seems to work as expected too

>>> pprint(es.search(index='testi', query={"bool":{"must":[{"match":  {'text': 'ċ' }}]}})['hits'])
{'hits': [{'_id': 'P13YIYoBpIJf8GfZbd8D',
           '_index': 'testi',
           '_score': 0.6931471,
           '_source': {'text': 'ċ'}},
          {'_id': 'QF3ZIYoBpIJf8GfZgt_s',
           '_index': 'testi',
           '_score': 0.6931471,
           '_source': {'text': 'Ċ'}}],
 'max_score': 0.6931471,
 'total': {'relation': 'eq', 'value': 2}}

Am I wrong regarding analyzers ?
Can someone explain how adding custom normalizer will resolve the issue ?

The example you are showing here uses the default analyzer, which seems to provide the expected result with respect to lower casing.

The case insensitive option within the term query you initially showed, does as far as I know not use the default analyser, so the behaviour may be different. Based on your initial post it would seem like it only is able to handle lower-casing of ascii characters, which is why I suggested creating a separate lowercased multifield in your mappings that use a char filter to ensure you get the behaviour you are expecting.

It could be that this is a bug or maybe a known limitation.

The case insensitive setting on the term query only works for ASCII text and is documented as such.
It was introduced mainly as a simple tool to help with searching keyword fields where no attempts to process the content had been made by the administrator (tokenising, stemming, normalising etc).
It’s not intended to be an alternative to normalizers or analyzers which are the tools required to deal with a variety of languages and content

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.