Hi,
I'm implementing regex search on ES 8.16 (using java client) against fields of type keyword and I came across strange issue when sending the case_insensitive: true option.
"[abc]+" -> "abc", "ABC"
"[ABC]+" -> "abc", "ABC"
"[a-c]+" -> "abc"
"[A-C]+" -> "ABC"
Is there an explanation for this or is it a bug in lucene or ES?
My guess is that the case sensitivity only applies to explicitly listed characters and because steps 3 and 4 have some unstated characters (b and B respectively) these are not interpreted as you'd hope.
This is arguably a bug in Lucene.
I've tried reproducing in lucene 10, but there it only returns exactly what's in the the regex, i.e. regarless of the ASCII_CASE_INSENSITIVE flag. So apparently the
This is sort of thing that would infuriate me if I found that the regex [abc]+ isn't treated exactly the same as [a-c]+. Really subtle and certainly counter-intuitive and therefore difficult to catch via testing in real world, creating really subtle user issues.
I note that at
it does not really give any detail on the scope of Lucene regex support - POSIX BRE / POSIX ERE / PCRE / ... It's clearly some subset, maybe documented elsewhere? Sadly the ES documentation at:
has exactly the (unfortunate) example above -
[ … ]
Match one of the characters in the brackets. For example:
[abc] # matches 'a', 'b', 'c'
Inside the brackets, - indicates a range unless - is the first character or escaped. For example:
[a-c] # matches 'a', 'b', or 'c'
The strong implication is that those 2 things are functionally equivalent. But they are clearly not 100% equivalent.
I can see the argument around character classes supporting different languages, where upper/lower case is more complex, so it's all more complicated than at first glance, .... But still it's hard to square that with [abc] meaning one thing [a-c] meaning another.
Suggestion to try to add some note to the ES doc above, I note it already documents the lack of anchors. Maybe also on pages where the case_insensitive: true/false toggle is documented.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.