We are experimenting elasticsearch 1.0.0, and are particularly excited
about the new aggregation feature.
Here is one of our use-case that we would like to optimize :
Right now, to imitate a basic SQL group by query that would look like :
SELECT day, hour, id, SUM(views), SUM(clicks), SUM(video_plays) FROM events
GROUP BY day, hour, id
Maxime, your bottleneck is likely in the script part. It has to dynamically
compute that per doc just like in sql. However, if you can precompute that
at index time (for example, introduce a field that contains the value of
date-hour-id, you should be able to improve that aggregation time
significantly. I did a quick test in 1.0 RC1 with an index of about 100K
docs, and if I precompute that term field (and eliminate the script part),
it is at least 10x faster than the script version. YMMV.
Unfortunately, we have about 8 different fields that could serve as
aggregation key, and a lot of potential combinations between these fields.
Thus, pre-computing all these combinations doesn't seem to be a viable
solution.
On Friday, January 31, 2014 7:52:40 AM UTC-8, Binh Ly wrote:
Maxime, your bottleneck is likely in the script part. It has to
dynamically compute that per doc just like in sql. However, if you can
precompute that at index time (for example, introduce a field that contains
the value of date-hour-id, you should be able to improve that aggregation
time significantly. I did a quick test in 1.0 RC1 with an index of about
100K docs, and if I precompute that term field (and eliminate the script
part), it is at least 10x faster than the script version. YMMV.
Maxime, forgot to mention, you can also distribute the load out by
increasing the shard count and adding more nodes. But precomputing the
field is probably the quickest way to improve that performance. Keep in
mind that unlike SQL, ES aggregations may return approximate metrics if you
have more than 1 shard.
For test purposes we currently have an index containing about 50M docs,
distributed on a 4 nodes cluster, with 16 shards.
Do you think that drastically increasing the number of shards would help ?
On Friday, January 31, 2014 10:14:08 AM UTC-8, Binh Ly wrote:
Maxime, forgot to mention, you can also distribute the load out by
increasing the shard count and adding more nodes. But precomputing the
field is probably the quickest way to improve that performance. Keep in
mind that unlike SQL, ES aggregations may return approximate metrics if you
have more than 1 shard.
This should be very fast, even when running on a single machine.
On Friday, January 31, 2014 3:36:20 AM UTC+2, Maxime Nay wrote:
Hi,
We are experimenting elasticsearch 1.0.0, and are particularly excited
about the new aggregation feature.
Here is one of our use-case that we would like to optimize :
Right now, to imitate a basic SQL group by query that would look like :
SELECT day, hour, id, SUM(views), SUM(clicks), SUM(video_plays) FROM
events GROUP BY day, hour, id
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.