Aggregation-"sql like" optimization guidance with elasticsearch 1.0.0


(Maxime Nay) #1

Hi,

We are experimenting elasticsearch 1.0.0, and are particularly excited
about the new aggregation feature.

Here is one of our use-case that we would like to optimize :

Right now, to imitate a basic SQL group by query that would look like :
SELECT day, hour, id, SUM(views), SUM(clicks), SUM(video_plays) FROM events
GROUP BY day, hour, id

we are issuing this kind of queries :

{
"size" : 0,
"query":{"match_all":{}},
"aggs" : {
"test_aggregation" : {
"terms" : {
"script" : "doc['day'].date + '-' + doc['hour'].value + '-'

  • doc['id'].value",
    "order" : { "_term" : "asc" },
    "size":
    },
    "aggs" : {
    "sum_click" : { "sum" : { "field" : "clicks" } },
    "sum_views" : { "sum" : { "field" : "views" } },
    "sum_video_plays" : { "sum" : { "field" : "video_plays" } }
    }
    }
    }
    }

But the perfs for this kind of queries are kind of low. Thus, we would like
to know if there are a more optimized way to get what we want.

Thanks !
Maxime

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4fc6e81a-6cc2-4050-84f5-4f82b69e9764%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

Maxime, your bottleneck is likely in the script part. It has to dynamically
compute that per doc just like in sql. However, if you can precompute that
at index time (for example, introduce a field that contains the value of
date-hour-id, you should be able to improve that aggregation time
significantly. I did a quick test in 1.0 RC1 with an index of about 100K
docs, and if I precompute that term field (and eliminate the script part),
it is at least 10x faster than the script version. YMMV.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3b87708-4435-40bb-9182-1f2a843f31c7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Maxime Nay) #3

Unfortunately, we have about 8 different fields that could serve as
aggregation key, and a lot of potential combinations between these fields.
Thus, pre-computing all these combinations doesn't seem to be a viable
solution.

On Friday, January 31, 2014 7:52:40 AM UTC-8, Binh Ly wrote:

Maxime, your bottleneck is likely in the script part. It has to
dynamically compute that per doc just like in sql. However, if you can
precompute that at index time (for example, introduce a field that contains
the value of date-hour-id, you should be able to improve that aggregation
time significantly. I did a quick test in 1.0 RC1 with an index of about
100K docs, and if I precompute that term field (and eliminate the script
part), it is at least 10x faster than the script version. YMMV.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/134a71d9-7683-4804-9ae9-449d40580b35%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #4

Maxime, forgot to mention, you can also distribute the load out by
increasing the shard count and adding more nodes. But precomputing the
field is probably the quickest way to improve that performance. Keep in
mind that unlike SQL, ES aggregations may return approximate metrics if you
have more than 1 shard.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2ee82d59-f8c7-41cc-b777-3af6e18f6200%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Maxime Nay) #5

For test purposes we currently have an index containing about 50M docs,
distributed on a 4 nodes cluster, with 16 shards.
Do you think that drastically increasing the number of shards would help ?

On Friday, January 31, 2014 10:14:08 AM UTC-8, Binh Ly wrote:

Maxime, forgot to mention, you can also distribute the load out by
increasing the shard count and adding more nodes. But precomputing the
field is probably the quickest way to improve that performance. Keep in
mind that unlike SQL, ES aggregations may return approximate metrics if you
have more than 1 shard.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/21874399-99c9-4c6b-8c76-f856ff95216f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Niko Nyrhila) #6

Hi,

You can nest aggregations, so in this case you'd first use Date Histogram
aggregation with an interval of one hour:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

Then you'd aggregate by "id" field:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

Here is an example:

This should be very fast, even when running on a single machine.

On Friday, January 31, 2014 3:36:20 AM UTC+2, Maxime Nay wrote:

Hi,

We are experimenting elasticsearch 1.0.0, and are particularly excited
about the new aggregation feature.

Here is one of our use-case that we would like to optimize :

Right now, to imitate a basic SQL group by query that would look like :
SELECT day, hour, id, SUM(views), SUM(clicks), SUM(video_plays) FROM
events GROUP BY day, hour, id

we are issuing this kind of queries :

{
"size" : 0,
"query":{"match_all":{}},
"aggs" : {
"test_aggregation" : {
"terms" : {
"script" : "doc['day'].date + '-' + doc['hour'].value +
'-' + doc['id'].value",
"order" : { "_term" : "asc" },
"size":
},
"aggs" : {
"sum_click" : { "sum" : { "field" : "clicks" } },
"sum_views" : { "sum" : { "field" : "views" } },
"sum_video_plays" : { "sum" : { "field" : "video_plays" } }
}
}
}
}

But the perfs for this kind of queries are kind of low. Thus, we would
like to know if there are a more optimized way to get what we want.

Thanks !
Maxime

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bb2293a1-b83c-45a1-af42-e48b3fd9a0c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #7