Terms aggregation scripts running slower than expected


(Thomas S.) #1

Hi,

I am currently exploring the option of using scripts with aggregations and
I noticed that for some reason scripts for terms aggregations are executed
much slower than for other aggregations, even if the script doesn't access
any fields yet. This also happens for native Java scripts. I'm running
Elasticsearch 1.1.0.

For example, on my data set the simple script "1" takes around 400ms for
the sum and histogram aggregations, but takes around 25s to run on a terms
aggregation, even on repeated runs. What is going on here? Terms
aggregations without a script are very fast, and histogram/sum aggregations
with scripts that access the document are also very fast: I had to
transform a script aggregation that should have been a terms aggregation
into a histogram and convert the numeric values back into terms on the
client so the aggregation would be executed in reasonable time.

In [2]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'terms': { 'script': '1' } } }})
Out[2]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'buckets': [{u'doc_count': 4231327,
u'key': u'1'}]}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 24986}

In [10]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'sum': { 'script': '1' } } }})
Out[10]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'value': 4231327.0}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 363}

In [8]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'histogram': { 'script': '1',
'interval': 1 } } }})
Out[8]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'buckets': [{u'doc_count': 4231327,
u'key': 1}]}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 421}

Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4af8942c-db46-47fa-9d38-370051a15c5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

The terms aggregation relies on the fact that field data produces unique
values in order to run efficiently. When you provide a script, by default
there will be a wrapper that will take care of deduplicating them in order
to make sure the result would be the same as if the data was stored in the
index.

You can tell Elasticsearch to assume that values are already unique by
passing script_values_unique: true to the terms aggregation. Can you
check if it makes the aggregation faster?

On Wed, Apr 9, 2014 at 9:36 PM, Thomas S. thomas.st@gmail.com wrote:

Hi,

I am currently exploring the option of using scripts with aggregations and
I noticed that for some reason scripts for terms aggregations are executed
much slower than for other aggregations, even if the script doesn't access
any fields yet. This also happens for native Java scripts. I'm running
Elasticsearch 1.1.0.

For example, on my data set the simple script "1" takes around 400ms for
the sum and histogram aggregations, but takes around 25s to run on a terms
aggregation, even on repeated runs. What is going on here? Terms
aggregations without a script are very fast, and histogram/sum aggregations
with scripts that access the document are also very fast: I had to
transform a script aggregation that should have been a terms aggregation
into a histogram and convert the numeric values back into terms on the
client so the aggregation would be executed in reasonable time.

In [2]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'terms': { 'script': '1' } } }})
Out[2]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'buckets': [{u'doc_count': 4231327,
u'key': u'1'}]}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 24986}

In [10]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'sum': { 'script': '1' } } }})
Out[10]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'value': 4231327.0}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 363}

In [8]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'histogram': { 'script': '1',
'interval': 1 } } }})
Out[8]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'buckets': [{u'doc_count': 4231327,
u'key': 1}]}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 421}

Thomas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4af8942c-db46-47fa-9d38-370051a15c5c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/4af8942c-db46-47fa-9d38-370051a15c5c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j45QsxBkdZePnrnd%2B36--yYZKfk19O_H2OGZUS57%3DGOpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Guillermo Arias del Río) #3

Thanks! That is a even a better solution. I have made some tests and it
works. The buckets - and their order - are almost always the same.

El miércoles, 9 de abril de 2014 21:36:16 UTC+2, Thomas S. escribió:

Hi,

I am currently exploring the option of using scripts with aggregations and
I noticed that for some reason scripts for terms aggregations are executed
much slower than for other aggregations, even if the script doesn't access
any fields yet. This also happens for native Java scripts. I'm running
Elasticsearch 1.1.0.

For example, on my data set the simple script "1" takes around 400ms for
the sum and histogram aggregations, but takes around 25s to run on a terms
aggregation, even on repeated runs. What is going on here? Terms
aggregations without a script are very fast, and histogram/sum aggregations
with scripts that access the document are also very fast: I had to
transform a script aggregation that should have been a terms aggregation
into a histogram and convert the numeric values back into terms on the
client so the aggregation would be executed in reasonable time.

In [2]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'terms': { 'script': '1' } } }})
Out[2]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'buckets': [{u'doc_count': 4231327,
u'key': u'1'}]}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 24986}

In [10]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'sum': { 'script': '1' } } }})
Out[10]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'value': 4231327.0}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 363}

In [8]: app.search.search({'size': 0, 'query': { 'match_all': {} },
'aggregations': { 'test_script': { 'histogram': { 'script': '1',
'interval': 1 } } }})
Out[8]:
{u'_shards': {u'failed': 0, u'successful': 246, u'total': 246},
u'aggregations': {u'test_script': {u'buckets': [{u'doc_count': 4231327,
u'key': 1}]}},
u'hits': {u'hits': [], u'max_score': 0.0, u'total': 4231327},
u'timed_out': False,
u'took': 421}

Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d7788b6-e33a-4859-8d6d-cd3be1a5006e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4