How to optimize regexp filter

The following is one of our slow queries, I got from our slow query log. I tried run it locally which takes 38 seconds and return 6 records, but if I removed the "regexp" filter, it only takes like 8 miliseconds to finish the query and return 7 records. My question is what is the best practice to optimize this kind of query with regex filter? Is there a way to put on a weight on the filter so it will be applied after all other filters? Or should I get all the data and do a local regex matching?

Notes: The value in tagv is relatively short like (4 - 20 chars). But over all there are 2 billions of documents in this index

{
"bool":{
"filter":[
{...},
{...},
{...},
{
"regexp":{
"tagv.raw":{
"value":".*-.*",
"flags_value":65535,
"max_determinized_states":10000,
"boost":1.0
}
}
},
{...}
],
"adjust_pure_negative":true,
"boost":1.0
}
}

You have created a very expensive regular expression, that basically has to check every value of a field for all matching documents in an index.

If your goal is to check for a dash (-), how about doing this on index time and then set a field like contains_dash: true|false. If you query that you will be back to your fast search.

1 Like

Thanks for the suggestion @spinscale,
We are using elasticsearch for our tagging functionality. The tagv is the value of the tag and there is another field called tagk which stores the key of the tag.

This regex is from our user, seems like they are trying to find a tag which contains the "-" in the value. In other words, we can not predict what/how the users will like to search during the indexing time. But is it that expensive to apply this regex to 7 string which are all 4-20 chars long(38 seconds)?

I tried a similar query (with".*" as regex) which returns 47 similar records, and it also takes 38 seconds, so I suspect this regex filter is not just applied to the result set. Is there a way to make it run at the end after other filters are applied? I tried using "post_filter" but doesn't see any difference

It's likely a function of the number of unique tagv.raw values in the index - it needs to run a regex on each of them in a linear fashion.
Can you use the cardinality aggregation on this field to let us know how many unique values exist?

Thanks @Mark_Harwood

Running the following gave me "504 Gateway Time-out", any suggestions?

I think you are right there seems to be tons of different tagv.raw values in the index now. Is there any way to avoid running regx on all of them?

curl localhost:9200/myindex/_search?size=0  -d'
{
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "tagv.raw"
      }
    }
  }
}'

@Mark_Harwood @spinscale

Any other suggestions? :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.