Regex on results only


#1

Why is it that

{
  "query": {
    "ids": {
      "type": "mytype",
      "values": [
        "AU8wgJipa1wkLOvkEVEL"
      ]
    }
  }
}

Is substantially faster (instant vs a minute) than

{
  "query": {
    "ids": {
      "type": "mytype",
      "values": [
        "AU8wgJipa1wkLOvkEVEL"
      ]
    }
  },
    "post_filter": {
    "regexp": {
      "field1": ".*somethingsomething.*"
    }
  }

I would have assumed that the slow regex would only happen on the result of the query (a single document in this case). If I were grepping a text file the regex on this one doc would be instant too but here it is adding nearly a minute tot eh query. I could perform the regex application side, but that doesn't feel right.

field1 is not analysed, I need to be able to do arbitrary regex there which is why I want the regex to only occur after a query to narrow down the results. The query on _id specifically is only for demonstrative purposes, usually I'm using a match there.

Thanks!


(Joshua Rich) #2

A better way to get multiple documents by id would be to use the mget API. There is really no need to search if you know the ids of the documents you want.

A post filter happens after the final result set is fetched from all participating shards. I'm not sure why it would be so slow, but it will generally be slower than say, normal filtering which you should be using instead. Note also that leading wildcards are extremely inefficient as you are essentially forcing the search to go over every term for that field.

Again, if you know what ids you want, use mget. If you don't try to use a regular filter with a search, and avoid leading wildcards.


#3

field1 is not analyzed and it is my understanding that there are no terms with an analyzed field. But lets say it was analyzed, from the sounds of it, it would be searching all the terms, from all the docs, for that field, and if that is happening, that certainly explains the behaviour.

I had assumed it would only do a regex on the result of the preceding query which seems to be a wrong assumption. If the preceding query doesn't help narrow the results for the regexp, then elasticsearch simply is not the right tool for me, and that's okay.


(Joshua Rich) #4

Why search for documents you don't need, then filter? Why not search and filter at the same time, which is what a filtered query does? Additionally, query filters can be cached, post filters cannot. I don't see any reason in your original code why you can't move that post_filter to an actual filter as part of the query. You almost never want to use a post_filter.


#5

Thank you for all your help so far. As I had mentioned, the _id query is only for demonstrative purposes, usually this is a match_phrase. Preferably this would simply be a regex, but I know my regex is very slow. I know that I can use a match_phrase to drastically cut down on the number of potential matches, I then use the regex to further refine the results to what I actually need. But it seems that whether I use the match_phrase or not the regex takes the same time to complete.

I'm not intentionally searching for documents I don't need, but what I am doing is helping reduce the number of documents which will need to be regexed on.

The problem seems that the regex is searching far more than the results of my match_phrase, or in this example, the _id query. In this example the search on _id is instant. But when I throw the regex in it is substantially longer. Far longer than what it should take to regex on a single document.


(system) #7