Regular expession queries in elasticsearch

I know that the .* operators match any characters of any length, including no characters and elastic has a RegexpQuery syntax.
So from that page I see that there are standard operators: . , ? , + , *, {…}, |, (…), [ … ].
Optional Operators: complement: a~bc, interval: foo<1-100>, intersection: aaa.+&.+bbb,
anystring: @&~(abc.+)

So my question is how can I solve ^ (beginning of line) or $ (end of line) character searching?
What can I use for '' character to treat next character literally. Example: in "\$100", the \ indicates that the pattern is "100", not end-of-line () followed by "100"?

Other question is how are these managed to search in elastic:

  • A negative assert of the form "(! subexpression )". Matches any sequence of characters in the target sequence that does not match the pattern between the delimiters, and does not change the match position in the target sequence.
  • A hexadecimal escape sequence of the form "\x hh ". Matches a character in the target sequence that is represented by the two hexadecimal digits hh .
  • A unicode escape sequence of the form "\u hhhh ". Matches a character in the target sequence that is represented by the four hexadecimal digits hhhh .
  • A control escape sequence of the form "\c k ". Matches the control character that is named by the character k .
  • A word boundary assert of the form "\b". Matches when the current position in the target sequence is immediately after a word boundary .
  • A negative word boundary assert of the form "\B". Matches when the current position in the target sequence is not immediately after a word boundary .
  • A dsw character escape of the form "\d", "\D", "\s", "\S", "\w", "\W". Provides a short name for a character class.

Thanks.

What is the high level problem you are trying to solve? Regexp queries can be quite slow and often do not scale well, so I would recommend only going down that path as a last resort.

Hi @Christian_Dahlqvist,

I need to query with a dtSearch query syntax on ElasticSearch.

Therefore, I think I need to support dtSearch RegExp syntax so for example I need to manage the following charaters in a regexp query:
\b, \d, \s, \w, ^, $

So for example I need to manage these RegExp queries:
(\b\d+\b)
(\b[a-z]+[0-9]+\b)

Thanks, Attila

Probably worth checking a few assumptions here first - are you searching text or keyword field types here?
With text fields the regular expressions are typically testing individual words in your content so notions of "line start", "word start" or whitespace are redundant - the content you are testing has already had been sliced up on the boundaries defined by your choice of analyzer.

@Mark_Harwood Regular expression queries should be managed in both text and keyword fields.
In my understanding if I have a property of a document for example I have an email document:

  • EmailFrom, EmailTo and Subject properties are keyword porperties while
  • Body has a text field type

I brought this example from this article.

I don't know if I am correct.
How would You determine which property should be text or keyword type?

So in my case there would be queries from both keyword and text fields.
How would You manage Regexp Queries in both of these types?

A single JSON field can be mapped as both.
Generally speaking it would be most likely to consider from/to = keyword and subject/body = text but it depends on your search and aggregation needs.

  • If you want an aggregation to know top participants on a topic then I'd use a keyword field on an "email_addresses" field which contained copies of both from and to fields (see "copy to" feature in mappings)
  • If you want to search for email addresses by domain name only I'd either
    • index email addresses as text as well as keyword to tokenize OR
    • parse out domain names into a separate fields in your client or using an "ingest pipeline"
  • If you want to find the most active email threads I'd consider having the subject field mapped as keyword but minus any "Re:" prefix added by replies.

I generally would only be looking to run any regex queries as a last resort. They're hard for users to write and read.
In elasticsearch matching words near each other in the body of text is handled using phrase queries. Text analyzers typically perform cleansing like case normalisation, stemming and punctuation removal to make searches simpler.
Structured queries like matching domain names can be made simpler if the regexes to extract domain names are applied at index time rather than query time.

It does depend on your specific content and queries though.

2 Likes

Thank @Mark_Harwood,

I need to do further investigation as You mentioned with using analyzer, phrase queries and text fields almost eveywhere.

@Mark_Harwood or others could You please help me with these Regexp characters, thanks:

  • \d : Whole Number 0 - 9,
  • \w : Alphanumeric Character,
  • \W : Symbols

That’s being worked on but won’t be available for some time.
For now you could write code in your client that expanded those shorthand expressions in search strings to their full form.

1 Like

Thanks @Mark_Harwood,

this is again a really useful answer, because from this we could see which are the equivalent for the regexp meta characters here.

You may actually want to beef up the expansion list - as my colleague Alan said:

"Defining a 'word character' as [a-zA-Z_0-9] doesn't feel very 2020..."
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.