Script Aggregations Significant Performance Degradation in 2.2.0

Hi All,

We have recently upgraded to 2.2.0 from 1.7.3, and have seen very high performance degradation for our queries with script aggregation fields. A sample query that takes about 5.8 seconds in 1.7.3 takes 86 seconds to complete in 2.2.0. The query includes two sum aggregations, one of them on a script field. It is also worth noting that even with the script aggregation removed, the query still slows down by about 2x - 4x.

We have turned on the index.queries.cache.everything flag mentioned in a few similar posts here, but that didn't seem to help. We already have doc values turned on in the previous version, which eliminates another frequent suggestion. 2.2.0 is also running on a cluster that is 2x the size of the former one. Here are the hot threads output for one complete call (and the query in case it's useful):

Would really appreciate any insight!

1 Like

Wow! That's a lot slower. I've poked a few people internally, hopefully they'll have a chance to take a look tomorrow.

Would it be possible to post an example of your query, just so we have a reference point? Do you know how many documents are matched by the query (e.g. how many docs are being evaluated by the script)?

Edit: Another thing to try, if possible, is to try 2.1. There were some scripting refactoring that went into 2.2 which could possibly be the culprit. Not sure if it's possible for you to spin up a representative 2.1 node/cluster with the same data, but thought I'd ask.

Thanks for your reply! The query is here: https://www.dropbox.com/s/765su59mzg58lso/query_public.txt?dl=0 . It matches ~200k documents in our system.

Re: trying 2.1: don't have the bandwidth immediately, but we're open to it if it'll help figure out the issue!

Did this script execute on the exact same set of data as 1.7? I ask because that script will have O(n) runtime based on the number of fields in the document (and the entire agg is O(n) based on number of documents). So if the query is executing against a different dataset which has some very large documents, or more matching docs (or both), it'll simply take longer to run.

Perhaps do a check of your docs and make sure there isn't an outlier with thousands of fields?

There may have been something in the 2.2 script isolation refactoring that slowed down the groovy script, but the script itself has some problems:

  • You should avoid using _source[foo] notation. That fetches the document from disk then parses the original JSON into memory. If you're matching 200k docs, that's 200k independent disk seeks and fetches, plus overhead of parsing

    Instead, you should use the doc[foo] notation, which uses the Doc Values column store. This is optimized for fast access and will likely be much faster than _source.

  • Do you really need a Hashmap? Since a script executes on each document, that means you are creating/destroying a new Hashmap 200k times. That's a lot of memory allocation which will have a non-trivial overhead, as well as the constant-time overhead of hashing each field into the map. Generally if you need to start doing complicated logic in a script, it's a good sign that this logic should be extracted into the document itself.

Let me know if you happen to be able to run against 2.1. I'll keep looking on my end too :slight_smile:

Hey Zach, thanks for looking into the script! The data set is the same. As for the script itself, I believe the fields we're accessing are too complex to be accessible via the doc notation. Here's the structure of one of the nested fields:

"field1": {
            "type": "nested",
            "properties": {
              "name": {
                "type": "string"
              },
              "value": {
                "type": "long"
              }
            }
          }

Please let me know if it's otherwise. To your second point, the Hashmap implementation could be hard to get rid of given what we need to accomplish, but it's something I can look more deeply into.

Finally, we are working towards finding time to test on 2.1 -- will keep you posted. Let us know if you and your team come up with any more insight! Thanks.

Hi Zachary,

I work with Wendy. We spent more time trying to get this into 2.1. We copied over the relevant data from our main production cluster into two different clusters - one running 1.7.5 and the other running 2.1.2 (I did not try 2.0 since you had not mentioned about scripting changes going from 2.0 to 2.1). Note that these two new clusters have exactly the same number of documents and was on exactly the same hardware. (on Elastic Cloud, actually)

We then ran the same query on those two clusters multiple times. On the 1.7 instance, we still see execution times between 4-5 seconds. On the 2.1 instance, we are noticing execution times between 30-40 seconds.

So I'm afraid the culprit is between 1.7 to 2.1...Hope this helps. Let us know if you need anything else from our side to get to the bottom of this.

Raj

Great, thanks @Rajagopal_Sathyamurt and @fwen. I'll report this back to the scripting folks and see what they have to say.

So looking at your script again, it appears you just want to sum up the values of the nested fields? You could try adding an include_in_root parameter to your nested field. This makes a copy of the field's value into a field on the root object. If there are multiple nested fields sharing the same name, all their values are flattened/denormalized into a single multi-valued array.

So you could use include_in_root to flatten all the values into one array, then sum that directly. It naturally omits fields that don't exist, so you don't need to do any scripting to set null/empty values to zero.

You can include it in the mapping like this:

"field1": {
  "type": "nested",
  "include_in_root": true,
  "properties": {
    "name": {
      "type": "string"
    },
    "value": {
      "type": "long"
    }
  }
}

Alternatively, I think I'd try to work out an approach using the nested or reverse_nested aggregations, instead of scripting. I'm not sure if it would be possible, but they will be much faster.

Based on the documentation here, it looks like values will be all flattened together, regardless of whether they share a name or not... Am I missing something?

What we would like here is to be able to sum all values corresponding to name1, all values corresponding to name2 and so on. Please note that we don't want to be storing 'name1', 'name2' itself as a field because it could lead to mapping explosion given that those names come from our customer's data.

We had looked into nested and revered_nested aggregations as well, but weren't able to apply them to this use-case.. Again, we aren't sure if we are missing something.

We are open to suggestions!

Sorry, the docs probably are a bit vague. When flattened into the root, they are flattened into a unique field defined by their full dot path. E.g. it'll generate a field on the root called foo.bar.baz: [1,2,3], whereas a second field would generate a field called foo.bar.not_baz: [10,20,30]

Here is a simple demo, which shows that include_in_root collapses to the same thing. It uses a simple sum and a nested sum:

PUT nested_test
{
   "mappings": {
      "test": {
         "properties": {
            "foo": {
               "type": "nested",
               "include_in_root": true,
               "properties": {
                  "numbers": {
                     "type": "integer"
                  },
                  "other_numbers": {
                     "type": "integer"
                  }
               }
            }
         }
      }
   }
}

POST /nested_test/test/
{
   "foo": [
      {
         "numbers": 1,
         "other_numbers": 10
      },
      {
         "numbers": 2,
         "other_numbers": 20
      }
   ]
}

GET /nested_test/test/_search
{
   "size": 0,
   "aggs": {
      "root_sum": {
         "sum": {
            "field": "foo.numbers"
         }
      },
      "root_sum_other": {
         "sum": {
            "field": "foo.other_numbers"
         }
      },
      "nested_sum": {
         "nested": {
            "path": "foo"
         },
         "aggs": {
            "sum": {
               "sum": {
                  "field": "foo.numbers"
               }
            }
         }
      },
      "nested_sum_other": {
         "nested": {
            "path": "foo"
         },
         "aggs": {
            "sum": {
               "sum": {
                  "field": "foo.other_numbers"
               }
            }
         }
      }
   }
}

And the output (note that root_sum + nested_sum are equal to 3, while root_sum_other + nested_sum_other are equal to 30):

{
   "aggregations": {
      "root_sum": {
         "value": 3
      },
      "nested_sum": {
         "doc_count": 2,
         "sum": {
            "value": 3
         }
      },
      "root_sum_other": {
         "value": 30
      },
      "nested_sum_other": {
         "doc_count": 2,
         "sum": {
            "value": 30
         }
      }
   }
}

But I can see how this might not work, since you are storing key:value pairs in the nested docs. So all value fields would be collapsed together.

Yeah, that's fair. It's possible you may be stuck with scripts, unless you can split tenants into their own index (to isolate the mappings).

Could you instead update counts at index-time, so that summing isn't needed? E.g. when updating a document, instead of appending a new nested object, you just update the sum?

Related, Mark Harwood's "entity-centric indexing" may be the right approach here:

https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes
https://www.elastic.co/videos/entity-centric-indexing-london-meetup-sep-2014

We could do that, but we need to be able to allow adding/deleting these 'names' at will. So adding a 'name' would mean updating the sum on every document. Right now, since we have the individual name value pairs, we are able to choose the 'names' that need to be sum up at query time. (while indexing all name value pairs initially)

The entity centric indexing sounds interesting. I'll take a deeper look at it.

Hi Folks
Is there any way by which i can access the results of aggregation in post filter scripts.
I want to filter out aggregation results that are lesser than some numbers.

Example i want to count distinct number of documents using cardinality type and show only results that have cardinatlity > 3

many thanks Kumar

Hello.

I was interested in this discussion since it's performance related discussion for script aggregation.

I appreciate if Rajagopal-san's team update the result of how Rajagopal-san's team eventually managed the aggregation.

Sounds like you may be looking for Bucket Selector Aggs: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline-bucket-selector-aggregation.html

Update: we eventually figured out a way to use a nested query in our aggregation, avoiding scripting altogether. We are planning to re-index the nested field so it's a not_analyzed string, so we can employ term queries to match the names we want.

Thanks everyone for your help!

thanks fwen, Managed to use the bucket-selector and it works fine.