Case insensitive regex query with character range

Hi,
I'm implementing regex search on ES 8.16 (using java client) against fields of type keyword and I came across strange issue when sending the case_insensitive: true option.

  1. "[abc]+" -> "abc", "ABC"
  2. "[ABC]+" -> "abc", "ABC"
  3. "[a-c]+" -> "abc"
  4. "[A-C]+" -> "ABC"

Is there an explanation for this or is it a bug in lucene or ES?

Thanks a lot for any insight!

My guess is that the case sensitivity only applies to explicitly listed characters and because steps 3 and 4 have some unstated characters (b and B respectively) these are not interpreted as you'd hope.
This is arguably a bug in Lucene.

2 Likes

I've tried reproducing in lucene 10, but there it only returns exactly what's in the the regex, i.e. regarless of the ASCII_CASE_INSENSITIVE flag. So apparently the

  1. "[abc]+" -> "abc", "ABC"
  2. "[ABC]+" -> "abc", "ABC"

is implemented in ES.

I believe the issue is that ranges where all characters are not explicitly stated are not working. Did you test the ones quoted above?

Yes, I've tested all these patterns, in Lucene with KeywordAnalyzer it's doing

  1. "[abc]+" -> "abc"
  2. "[ABC]+" -> "ABC"
  3. "[a-c]+" -> "abc"
  4. "[A-C]+" -> "ABC"

Lucene test GitHub - petrsimon/lucene-regex

I agree and at least this should be documented I think.

I've reported it at Case insensitive regex query with character range · Issue #14378 · apache/lucene · GitHub. Thanks all for your feedback.

Seems like it's a known limitation of Lucene. Perhaps it would be nice to document it also in ES?

2 Likes

Good spot @petrsimon !!

This is sort of thing that would infuriate me if I found that the regex [abc]+ isn't treated exactly the same as [a-c]+. Really subtle and certainly counter-intuitive and therefore difficult to catch via testing in real world, creating really subtle user issues.

I note that at

it does not really give any detail on the scope of Lucene regex support - POSIX BRE / POSIX ERE / PCRE / ... It's clearly some subset, maybe documented elsewhere? Sadly the ES documentation at:

has exactly the (unfortunate) example above -

[ … ]

Match one of the characters in the brackets. For example:

[abc] # matches 'a', 'b', 'c'

Inside the brackets, - indicates a range unless - is the first character or escaped. For example:

[a-c] # matches 'a', 'b', or 'c'

The strong implication is that those 2 things are functionally equivalent. But they are clearly not 100% equivalent.

I can see the argument around character classes supporting different languages, where upper/lower case is more complex, so it's all more complicated than at first glance, .... But still it's hard to square that with [abc] meaning one thing [a-c] meaning another.

Suggestion to try to add some note to the ES doc above, I note it already documents the lack of anchors. Maybe also on pages where the case_insensitive: true/false toggle is documented.

1 Like

Thanks all for reporting this and helping on this topic.

@Liam_Thompson created this PR to add some clarification in our documentation.

2 Likes