Understanding regexp query better to avoid query failures and OOMs

Hi Guys,

I have been trying to get my head around how Regexp Query works in
Elasticsearch. To my knowledge, it uses Lucene's Regex Engine, which is
limited. A problem with running regexp query on a particular field can be
expensive depending upon the number of unique terms in the index for that
field. So if a field has a value "brown sugar cake", and if the standard
default tokenizer is in-use, then any regex expression provided in the
regexp query for the field holding the mentioned value will run against
brown, sugar and cake and not on the entire string. For this reason, regex
in Elasticsearch (and Lucene) becomes expensive. Am I correct?

Assuming that I am, I have a further question. If the performance of Regex
really depends on the number of unique terms in a field, then reducing the
number of unique tokens should significantly boost up the performance. So
running regexp queries on not_analyzed fields should help. But that's not
the case really and regexp is still extremely slow. In my case, the field
is called URL and it holds URL with the query parameters. The field is
not_analyzed. In most of the cases, a simple regex is fast enough but if
the regex gets slightly complicated, I never get a response from the
server. I also noticed on a local ES server, that the memory starts
increasing and eventually I get an OOM exception.

Another thing that is beyond my understanding is the variables on which
performance of a regexp query works. Just to test that, I created a new
index with just 1 document. The document looks something like this:

{
"url": "https://abc.com/launchingsoon?product=imgburn&",
"ts": 123456679,
"os": "Linux",
...
}

Remember there is just 1 document in the index. I ran the following regex
query:

GET /INDEX/_search
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"regexp": {
"url":
".(cacaoweb|youtube-to-mp3-converter|google-chrome|itunes|adwcleaner|msn-messenger-skype|skype|adobe-flash-player-ie|firefox|jpeg-to-pdf|avira-antivir-personal---free-antivirus|irfanview|mp3-converter|realplayer|adobe-reader|youtube-download--convert|internet-explorer-8|windows-live-mail|windows-live-movie-maker-2011|ccleaner|zune-software|vanbascos-karaoke-player|amule|karaoke|imgburn|google-earth|internet-explorer-9|mp3jam|media-downloader|avg-anti-virus-free-edition|k-lite-codec-pack-full|vwo|windows-media-player|opera|kmplayer|sopcast|drweb-cureit|vwo)."
}
}
]
}
}
}
}
}

This query ran but it took about 400 ms on my local machine. Then I ran the
following query which has the same regular expression but a very
unoptimized regular expression:

GET /INDEX/_search
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"regexp": {
"url.not_analyzed":
".cacaoweb.|.youtube-to-mp3-converter.|.google-chrome.|.itunes.|.adwcleaner.|.msn-messenger-skype.|.skype.|.adobe-flash-player-ie.|.firefox.|.jpeg-to-pdf.|.avira-antivir-personal---free-antivirus.|.irfanview.|.mp3-converter.|.realplayer.|.adobe-reader.|.youtube-download--convert.|.internet-explorer-8.|.windows-live-mail.|.windows-live-movie-maker-2011.|.ccleaner.|.zune-software.|.vanbascos-karaoke-player.|.amule.|.karaoke.|.imgburn.|.google-earth.|.internet-explorer-9.|.mp3jam.|.media-downloader.|.avg-anti-virus-free-edition.|.k-lite-codec-pack-full.|.photoscape.|.windows-media-player.|.opera.|.kmplayer.|.sopcast.|.drweb-cureit."
}
}
]
}
}
}
}
}

This query took a lot of time. Logs were showing that the GC would kicking
in after every 3-5 seconds. And finally the query fails with an OOM
exception. I have been trying to understand what's the reason for this
query to make OOM happen. After OOM, the ES node just becomes unresponsive
until the GC is actually able to clear up some m.emory. This is the exact
exception I get in the logs: http://pastebin.mozilla.org/6975835.

In the above case, I understand the regex is not optimized for
Elasticsearch's (or rather Lucene's) regex engine. But an unoptimized regex
requires a lot of memory? I don't quite understand that.

I don't know what's causing this and I really need to understand how Regexp
Queries work in Elasticsearch and how they work in Lucene.

Vaidik Kapoor
vaidikkapoor.info

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5nqQr-RqLeSp4t1KBaojByff8_nnpi38V-zhSodB3b%3D8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.