The following is one of our slow queries, I got from our slow query log. I tried run it locally which takes 38 seconds and return 6 records, but if I removed the "regexp" filter, it only takes like 8 miliseconds to finish the query and return 7 records. My question is what is the best practice to optimize this kind of query with regex filter? Is there a way to put on a weight on the filter so it will be applied after all other filters? Or should I get all the data and do a local regex matching?
Notes: The value in tagv is relatively short like (4 - 20 chars). But over all there are 2 billions of documents in this index
You have created a very expensive regular expression, that basically has to check every value of a field for all matching documents in an index.
If your goal is to check for a dash (-), how about doing this on index time and then set a field like contains_dash: true|false. If you query that you will be back to your fast search.
Thanks for the suggestion @spinscale,
We are using elasticsearch for our tagging functionality. The tagv is the value of the tag and there is another field called tagk which stores the key of the tag.
This regex is from our user, seems like they are trying to find a tag which contains the "-" in the value. In other words, we can not predict what/how the users will like to search during the indexing time. But is it that expensive to apply this regex to 7 string which are all 4-20 chars long(38 seconds)?
I tried a similar query (with".*" as regex) which returns 47 similar records, and it also takes 38 seconds, so I suspect this regex filter is not just applied to the result set. Is there a way to make it run at the end after other filters are applied? I tried using "post_filter" but doesn't see any difference
It's likely a function of the number of unique tagv.raw values in the index - it needs to run a regex on each of them in a linear fashion.
Can you use the cardinality aggregation on this field to let us know how many unique values exist?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.