Groovy vs. Painless performance difference

Hi,

We are using ElasticSearch 5.3.0 and noticed a dramatic difference in performance between groovy and painless.
The index has 5 shards and about 30M documents, with a simple mapping (about 15 fields, all strings are keyword).

The following query takes over 70 seconds to complete - while it is running I see spikes in both CPU utilization and young GC.

POST test-idx/_search
{
"from" : 0,
"size" : 1000,
"query" : {
"match_all" : { }
},
"aggregations" : {
"Hostname" : {
"terms" : {
"script" : {
"inline" : "(_source.Hostname == null) ? null : _source.Hostname",
"lang" : "groovy"
},
"missing" : "NULL_STRING_TAG",
"size" : 2147483647
}
}
}
}

When switching to painless, the query takes ~4 seconds and none of the aforementioned effects are observed:

POST test-idx/_search
{
"from" : 0,
"size" : 1000,
"query" : {
"match_all" : { }
},
"aggregations" : {
"Hostname" : {
"terms" : {
"script" : {
"inline" : "doc['Hostname'].value == null ? null : doc['Hostname'].value",
"lang" : "painless"
},
"missing" : "NULL_STRING_TAG",
"size" : 2147483647
}
}
}
}

Is there an explanation for this drastic difference in performance? Is there some way we can optimize the groovy script to make it run faster (although it seems very simple to me)?

After doing some reading I see that the difference is not the language but the use of _source which slows the query down significantly.

I would like to pursue this issue further, as I am trying to find a way to optimize cases where _source is used (I don't always have control over the query).

I've done some benchmarking, and the difference between doc_values and _source is tremendous in some cases.

In the example below, I am querying an alias that has two indices, with a total amount of about 430M documents (approx. 270GB in size, stored in ES).

The following query using doc values takes ~38 seconds to execute and I see some moderate load on the data nodes CPU while it is running

{
"from" : 0,
"size" : 0,
"query" : {
"match_all" : { }
},
"aggregations" : {
"site" : {
"terms" : {
"script" :
{ "inline" : "doc["Site"].value == null ? null : doc["Site"].value", "lang" : "groovy" }
,
"missing" : "NULL_STRING_TAG",
"size" : 2147483647
}
}
}
}

In contrast, when using _source, it takes over 11 minutes (!!) to execute the query and during this time the data nodes CPU is almost fully utilized, and the cluster is almost completely unresponsive:

{
"from" : 0,
"size" : 0,
"query" : {
"match_all" : { }
},
"aggregations" : {
"site" : {
"terms" : {
"script" :
{ "inline" : "(_source.Site == null) ? null : _source.Site", "lang" : "groovy" }
,
"missing" : "NULL_STRING_TAG",
"size" : 2147483647
}
}
}
}

Is there anything that can be done to improve the performance when using _source? Would using dedicated client nodes help?

As a group the Elasticsearch maintainers aren't really a big fan of _source in search time queries. While it is useful for one off experimentation we really don't think you can ever expect it to be fast. Dedicated client nodes aren't going to help here. _source is stored on disk in a way that is well compressed but not good for lots of accesses. It is a great structure for returning a few documents from a search, but doc values are much nicer for search scripts like this.

2 Likes

Thanks for the response Nik. I don't actually expect _source to be fast, in fact there are two things I am interested in:

  1. Is the performance difference I encountered between doc_values and _source reasonable (~x17)?
    Can you shed some light on what's actually going on behind the scenes in ElasticSearch when running this query with source that can explain the heavy load on the CPU I am seeing? Looking at hot_threads there seems to be a lot of JSON parsing going on.

  2. Given all of the limitations of _source, is there anything at all that can be done to boost it's performance?

We might be able to boost _source's performance in search scripts but I don't think we have much in the way of appetite for adding any complexity there. There are other options that we recommend, mostly doc values like you've been using.

Under the hood _source is stored in Lucene "stored field". These fields are stored in a way to optimize two things:

  1. Storage space
  2. Fetching all the fields are once

They are stored by taking the stored fields from a few documents and sticking them together in a chunk and then compressing that chunk. That means that when you load _source we have to decompress enough of the chunk that we can get the entire _source for your document read. So we might have to decompress more documents if they are stored in the same chunk.

Then we have to deserialize the _source from whatever format it is stored in, converting it into a Java Map to pass to the script. All to get the one field.

Theoretically there is a lot we could do to make _source faster in search scripts but doc values are already stored in a much more sensible way for this kind of thing. They are stored column-wise so it is much faster to get the value for a single document. So even if we worked hard to save time on deserialization we couldn't really beat the whole chunking problem.

And yes, 17x performance difference is totally reasonable. I only use _source in search scripts where I don't care how long they take.

2 Likes

Thanks Nik, this clarifies a lot :slight_smile:
The only issue I have with doc_values is that it doesn't work for analyzed fields; when writing a script, you can't really know in advance if a certain field is analyzed or not, so _source is the only way to make sure the script will be able to retrieve the field value, no matter the type.

Are there any plans to address this limitation?

Assuming the fix can take time (if it'll be decided that it is required),
Are there any circuit breakers that can at least protect cluster overload?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.