From Elasticsearch Regex to Java Pattern

Hi,

Does anyone know if there's an easy way to go from ES's flavor of Regex to Java's?

For example, if I understand the documentation correctly:

  1. the ES pattern /"some\doslike\path"/ would match some\doslike\path because everything between double quotes is escaped. However with Java the same pattern would (a) look for quotes and (b) try and interpret \d and \p as pre-defined character classes.

  2. ES has a numeric range pattern /<001-100>/ matches 002 but not 2 or 02. Java pattern /<001-100>/ doesn't match any numbers because it doesn't have a numeric range pattern.

Does anyone know how to convert these ES "extras" into a java.util.regexp.Pattern that matches the same things? Or is there a java jar I can use to implement to use an alternative to java.util.regexp.Pattern that recognizes the entire set of ES regexp syntax?

Thanks.

p.s. no need to tell me "don't use regexes" or "regexes are slow" - but thanks for reading and being concerned.

In case anyone else is interested: I've tracked down the regular expression engine used by the Lucene implementation. See: dk.brics.automaton

It should be straightforward to use these classes instead of java.util.regexp.pattern. From the FAQ at that site:

RegExp r = new RegExp("ab(c|d)*");
Automaton a = r.toAutomaton();
String s = "abcccdc";
System.out.println("Match: " + a.run(s)); // prints: true

And here's what it looks like when using the lucene packages:

import org.apache.lucene.util.automaton.RegExp;
import org.apache.lucene.util.automaton.Automaton;
import org.apache.lucene.util.automaton.CharacterRunAutomaton;

String actualPattern = "ab(c|d)*";
String target = "abcccdc";

Automaton a = new RegExp(actualPattern).toAutomaton();                
boolean isAMatch = new CharacterRunAutomaton(a).run(target);

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.