Query run optimization for percolation

anishek · December 24, 2015, 11:10am

We are using percolation queries for tagging user docs. We have currently about 80 queries. the most time consuming part of these queries is the geo distance filter we are using upto on 10000 points within each query.

our doc structure is something like

{
  "user_profiles": {
    "dynamic": "strict",
    "properties": {
      "code": {
        "type": "string",
        "index": "not_analyzed"
      },
      "freq_1": {
        "type": "geo_point"
      }
    }
  }
}

"code" -- is four letter upper case alphabetic. there are only 13 variations of it in the percolation queries. user docs can have any variation.

our sample percolation query is like:

{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "code": "ABCD"
        }
      },
      "filter": {
        "and": [{
          "or": [{
            "geo_distance": {
              "distance": "1m",
              "freq_1": {
                "lat": 2,
                "lon": 3
              }
            }
          }]
        }]
      },
      "strategy": "query_first"
    }
  }
}

we have a lot of docs that, we run through percolation queries, where the code is will not match any of the queries.

is it better to use the percolation query metadata by putting the country code as a attribute on the percolation query and then filtering for each doc, which queries runs against the doc or just use the above method where all queries will run against all docs but they will be able to process very fast as we are doing query_first on code ?

jpountz · December 24, 2015, 1:04pm

Putting the country code as an attribute of queries and using it at percolation time will certainly help performance, especially as the number of queries that you percolate grows.

Are you stuck on elasticsearch 1.x? If you are on version 2.x, elasticsearch should automatically figure out that it needs to do the costly bits last (the geo computation in that case) so there would be no need to be specific about how to run the query.

anishek · December 28, 2015, 6:23am

@jpountz we are using es 1.7.4, but if we specify the strategy then the costly bit should be done last and we should still be ok with the way have queries now right ?

number of queries for us might not go over 1000, does that still pose a problem ?

jpountz · December 28, 2015, 9:50am

Yes specifying the strategy would work on 1.7. Having few queries is not a problem, I just wanted to highlight the fact that the approach that you suggested would work even better if you have many queries.