How to optimize regexp filter

junhuangli · October 21, 2020, 9:40pm

The following is one of our slow queries, I got from our slow query log. I tried run it locally which takes 38 seconds and return 6 records, but if I removed the "regexp" filter, it only takes like 8 miliseconds to finish the query and return 7 records. My question is what is the best practice to optimize this kind of query with regex filter? Is there a way to put on a weight on the filter so it will be applied after all other filters? Or should I get all the data and do a local regex matching?

Notes: The value in tagv is relatively short like (4 - 20 chars). But over all there are 2 billions of documents in this index

{
"bool":{
"filter":[
{...},
{...},
{...},
{
"regexp":{
"tagv.raw":{
"value":".*-.*",
"flags_value":65535,
"max_determinized_states":10000,
"boost":1.0
}
}
},
{...}
],
"adjust_pure_negative":true,
"boost":1.0
}
}

spinscale · October 22, 2020, 9:53am

You have created a very expensive regular expression, that basically has to check every value of a field for all matching documents in an index.

If your goal is to check for a dash (-), how about doing this on index time and then set a field like contains_dash: true|false. If you query that you will be back to your fast search.

junhuangli · October 22, 2020, 4:28pm

Thanks for the suggestion @spinscale,
We are using elasticsearch for our tagging functionality. The tagv is the value of the tag and there is another field called tagk which stores the key of the tag.

This regex is from our user, seems like they are trying to find a tag which contains the "-" in the value. In other words, we can not predict what/how the users will like to search during the indexing time. But is it that expensive to apply this regex to 7 string which are all 4-20 chars long(38 seconds)?

I tried a similar query (with".*" as regex) which returns 47 similar records, and it also takes 38 seconds, so I suspect this regex filter is not just applied to the result set. Is there a way to make it run at the end after other filters are applied? I tried using "post_filter" but doesn't see any difference

Mark_Harwood · October 22, 2020, 4:56pm

It's likely a function of the number of unique tagv.raw values in the index - it needs to run a regex on each of them in a linear fashion.
Can you use the cardinality aggregation on this field to let us know how many unique values exist?

junhuangli · October 22, 2020, 5:50pm

Thanks @Mark_Harwood

Running the following gave me "504 Gateway Time-out", any suggestions?

I think you are right there seems to be tons of different tagv.raw values in the index now. Is there any way to avoid running regx on all of them?

curl localhost:9200/myindex/_search?size=0  -d'
{
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "tagv.raw"
      }
    }
  }
}'

junhuangli · October 23, 2020, 9:09pm

@Mark_Harwood @spinscale

Any other suggestions?

system · November 20, 2020, 9:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Speeding up elastic search regex filters/query optimization Elasticsearch	2	2517	July 5, 2017
Minimise dataset for regexp query Elasticsearch	3	707	July 5, 2017
Improving search query time Elasticsearch	6	413	January 5, 2021
Regex on results only Elasticsearch	5	1052	July 5, 2017
Understanding regexp query better to avoid query failures and OOMs Elasticsearch	1	887	July 6, 2017

How to optimize regexp filter

Related topics