Non-nested query on copy_to fields performs worse then a nested query

Our service uses a lot of nested queries/data, in order to improve our response time I tried to rewrite our heaviest query and use copy_to to copy the fields I need from inside the nested docs to the root doc and query them, expecting a significant performance improvement

Our indexes look like this (simplified and only the relevant fields for the question)

{
  "mappings": {
    "properties": {
      "nested_field_a" : {
          "type" : "nested",
          "properties" : {
            "inner_field" : {
              "type" : "keyword",
              "copy_to" : [
                "_flat_inner_field_a"
              ]
            }
          }
        },
      "_flat_inner_field_a": {
        "type" : "keyword"
      },
      "nested_field_b" : {
          "type" : "nested",
          "properties" : {
            "inner_field" : {
              "type" : "keyword",
              "copy_to" : [
                "_flat_inner_field_b"
              ]
            }
          }
        },
      "_flat_inner_field_b": {
        "type" : "keyword"
      }
    }
  }
}

Our query queries 6 different indices at once, totaling ~11 billion docs, each doc with between 1-5 (usually only 1), nested docs across the two nested fields (some only have docs in one of the nested fields)

The query tries to find all docs that contain at least one of a bunch of terms, it is guaranteed to find at least 1 doc for each term (can reach tens of docs per term as well)

Our existing query looks like this

"query": {
    "bool": {
      "should": [
        {
          "nested": {
            "path": "nested_field_a",
            "query": {
              "terms": {
                "nested_field_a.inner_field": [
                  1,
                  2
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "nested_field_b",
            "query": {
              "terms": {
                "nested_field_b.inner_field": [
                  "a",
                  "c"
                ]
              }
            }
          }
        }
      ]
    }
  }

(1,2, a,c are just examples, the lists contain unique values that change with each request, with 1-20 items each)

And the new one like this:

"query": {
    "bool": {
      "should": [
        {
          "terms": {
            "_flat_inner_field_a": [
              1,
              2
            ]
          }
        },
        {
          "terms": {
            "_flat_inner_field_b": [
              "a",
              "c"
            ]
          }
        }
      ]
    }
  }

With the new query, not only do I not see the speed improvement I expected, the query performs up to 40% slower!

Profile for the old query (full profile) :

"type" : "BooleanQuery",
"description" : "ToParentBlockJoinQuery (ConstantScore(nested_field_a.inner_field:1 nested_field_a.inner_field:2)) ToParentBlockJoinQuery (ConstantScore(nested_field_b.inner_field:a nested_field_b.inner_field:c))",
"time" : "697.5micros"

Profile for the new query (full profile) :

"type" : "BooleanQuery",
"description" : "+(ConstantScore(_flat_inner_field_a:1 _flat_inner_field_a:2) ConstantScore(_flat_inner_field_b:a _flat_inner_field_b:c)) #DocValuesFieldExistsQuery [field=_primary_term]",
"time" : "1ms"

The difference I see is that the new query no longer performs the two joins, but adds a DocValuesFieldExistsQuery for some reason

  • Our production env runs on elastic 7.3, the elastic env I used for the examples above is 7.8
  • The example above was performed on a local elastic on my pc, a single index with this doc in it
{
  "nested_field_a": [{"inner_field": 1}, {"inner_field": 2}, {"inner_field": 3}],
  "nested_field_b": [{"inner_field": "a"}, {"inner_field": "c"}, {"inner_field": "k"}]
}

I'd appreciate your help with this, thanks :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.