Speeding up elastic search regex filters/query optimization


(Abhijith Reddy) #1

We are currently using regex filters to support contains query for an application, as expected the performance is pretty abysmal since the regex doesn't have any leading prefix. Below is an example query that we are using

[root@machine ~]# time curl -XGET 'localhost:9200/items_search/_count?routing=123&pretty' -d '{
   "query" : {
      "filtered" : {
         "filter" : {
            "bool" : {
                  "must" : [
                    { "term"  : { "cat_id" : "123"}},
                    { "term"  : { "availability"       : "in stock"}},
                    { "regexp": { "category.lowercase" : ".*women .*"}}
                  ]
           }
         }
      }
   }
}'
{
  "count" : 11323,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  }
}

real    0m41.518s
user    0m0.005s
sys 0m0.001s

If I change the regex to "women .*" the query returns within a couple of seconds. One thing to note is that the shard that this query is getting routed to is around 80 GB.
I understand that the right way to do this would be use ngram analyzers on the fields that we want have contains on which would speed up search queries, however we currently have over 3.5 billion documents in our index and the number of queries that use regex are very small so changing the analyzers for the field (currently the field is not analyzed) would hurt our indexing rate.
Are there any work around for this ? Any pointers or resources would be much appreciated.

Thanks


(Mark Walkom) #2

Regexp is slow, but you're basically saying you want to check all category.lowercase fields for anything that has the word women in it, which means you have to parse the entire field for every document.
The women .* search is a little better as you only check the start of the field.

You are going to be better off creating a specific field to mention the value you are after.


(system) #3