Case insensitive regex query with character range

petrsimon · March 19, 2025, 5:08pm

Hi,
I'm implementing regex search on ES 8.16 (using java client) against fields of type keyword and I came across strange issue when sending the case_insensitive: true option.

"[abc]+" -> "abc", "ABC"
"[ABC]+" -> "abc", "ABC"
"[a-c]+" -> "abc"
"[A-C]+" -> "ABC"

Is there an explanation for this or is it a bug in lucene or ES?

Thanks a lot for any insight!

Mark_Harwood1 · March 19, 2025, 11:14pm

My guess is that the case sensitivity only applies to explicitly listed characters and because steps 3 and 4 have some unstated characters (b and B respectively) these are not interpreted as you'd hope.
This is arguably a bug in Lucene.

petrsimon · March 20, 2025, 9:03am

I've tried reproducing in lucene 10, but there it only returns exactly what's in the the regex, i.e. regarless of the ASCII_CASE_INSENSITIVE flag. So apparently the

"[abc]+" -> "abc", "ABC"
"[ABC]+" -> "abc", "ABC"

is implemented in ES.

Christian_Dahlqvist · March 20, 2025, 9:11am

I believe the issue is that ranges where all characters are not explicitly stated are not working. Did you test the ones quoted above?

petrsimon · March 20, 2025, 9:25am

Yes, I've tested all these patterns, in Lucene with KeywordAnalyzer it's doing

"[abc]+" -> "abc"
"[ABC]+" -> "ABC"
"[a-c]+" -> "abc"
"[A-C]+" -> "ABC"

petrsimon · March 20, 2025, 9:38am

Lucene test GitHub - petrsimon/lucene-regex

dadoonet · March 20, 2025, 10:15am

I agree and at least this should be documented I think.

petrsimon · March 20, 2025, 12:33pm

I've reported it at Case insensitive regex query with character range · Issue #14378 · apache/lucene · GitHub. Thanks all for your feedback.

petrsimon · March 20, 2025, 3:42pm

Seems like it's a known limitation of Lucene. Perhaps it would be nice to document it also in ES?

RainTown · March 20, 2025, 4:47pm

Good spot @petrsimon !!

This is sort of thing that would infuriate me if I found that the regex [abc]+ isn't treated exactly the same as [a-c]+. Really subtle and certainly counter-intuitive and therefore difficult to catch via testing in real world, creating really subtle user issues.

I note that at

it does not really give any detail on the scope of Lucene regex support - POSIX BRE / POSIX ERE / PCRE / ... It's clearly some subset, maybe documented elsewhere? Sadly the ES documentation at:

has exactly the (unfortunate) example above -

[ … ]

Match one of the characters in the brackets. For example:

[abc] # matches 'a', 'b', 'c'

Inside the brackets, - indicates a range unless - is the first character or escaped. For example:

[a-c] # matches 'a', 'b', or 'c'

The strong implication is that those 2 things are functionally equivalent. But they are clearly not 100% equivalent.

I can see the argument around character classes supporting different languages, where upper/lower case is more complex, so it's all more complicated than at first glance, .... But still it's hard to square that with [abc] meaning one thing [a-c] meaning another.

Suggestion to try to add some note to the ES doc above, I note it already documents the lack of anchors. Maybe also on pages where the case_insensitive: true/false toggle is documented.

dadoonet · March 21, 2025, 3:07pm

Thanks all for reporting this and helping on this topic.

@Liam_Thompson created this PR to add some clarification in our documentation.

github.com/elastic/elasticsearch

[DOCS] Clarify regex character range case insensitivity limitations

8.x ← leemthompo-patch-4

opened 02:54PM - 21 Mar 25 UTC

leemthompo

+9 -0

Flagged by @dadoonet in Slack: > There's an interesting discussion [on discu…ss](https://discuss.elastic.co/t/case-insensitive-regex-query-with-character-range/376136/9?u=dadoonet) about [Lucene regex](https://lucene.apache.org/core/10_1_0/core/org/apache/lucene/util/automaton/RegExp.html) support. > Our [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html#regexp-standard-operators) says things like: > > - `[a-c] # matches 'a', 'b', or 'c'` > - `[^a-c] # matches any character except 'a', 'b', or 'c'` > > Which is correct unless you use as well `case_insensitive: true` option. In which case you would expect `B` to match `[a-c]`. But it does not work that way and it's a known limitation as Robert answered in [https://github.com/apache/lucene/issues/14378#issuecomment-2740658343](https://github.com/apache/lucene/issues/14378#issuecomment-2740658343). >

Topic		Replies	Views
Regexp and case insensitive Elasticsearch	3	13344	July 5, 2017
Case sensitivity in ES Elasticsearch	11	16923	July 6, 2017
Case insensitive nested query string Elasticsearch	4	3523	September 3, 2018
Case-Insensitive regex-based search for text fields in ES 5.6.3 Elasticsearch	1	420	June 9, 2019
How to get the case insensitive results from a regex query Elasticsearch	4	6324	July 5, 2017

Case insensitive regex query with character range

Related topics