Query String vs Bool Performance?

gposcidonio · July 15, 2019, 9:48pm

Background

So for some context before my question, I have an enum, let's call it my_enum. It has 5 possible values, all single characters: Z, Y, X, W, and V. My documents can have 0 or more of these values in their my_enum field.

My goal is to write a search query where a user can specify any subset of those enums (e.g. [Z, X, W]) and I will return any documents that have a my_enum field which is a subset of the specified set. So for a query asking for [Z, X, W] I would return any documents that have the following my_enum values:

[Z, X, W]
[Z, X]
[X, W]
[Z, W]
[Z]
[X]
[W]
[]

Solutions

I believe I've written two queries that return equivalent results. Here is an example for a search for the set [Z, X, W]:

Using `query_string`

POST /oracle_cards/_search
{
  "size": 100,
  "query": {
    "query_string": {
      "default_field": "my_enum",
      "query": "(Z -Y X W -V) OR (-Z -Y -X -W -V)"
    }
  },
  "sort": [
    {
      "_id": {
        "order": "desc"
      }
    }
  ]
}

Using `bool`

POST /oracle_cards/_search
{
  "size": 100,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_enum": "Z"
          }
        },
        {
          "match": {
            "my_enum": "W"
          }
        },
        {
          "match": {
            "my_enum": "X"
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "exists": {
                  "field": "my_enum"
                }
              }
            ]
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "my_enum": "Y"
          }
        },
        {
          "match": {
            "my_enum": "V"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "_id": {
        "order": "desc"
      }
    }
  ]
}

My Question

This query should be as fast as possible, so my question is: will one of these perform better than the other? If so, why? I imagine the query_string will be less performant because it has to parse the query string first, but I'm not sure as I can't find any documentation about it.

DavidTurner · July 16, 2019, 6:17am

The simplest way to answer this question is to go ahead and try the two options with your real data, because the answer depends on so many factors you have left unspecified. I would expect that parsing a short query string like the one above would be a trivial fraction of the total cost of a search. You may like to use the profile API to profile the two searches, and Rally for benchmarking.

gposcidonio · July 16, 2019, 11:55am

I didn’t know about those tools, that should help quite a bit. I’ll report back with my findings!

system · August 13, 2019, 11:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.