Hi,
I am trying to visualize some data and I am using elasticsearch sql to build some visualizations. I am importing data about 346M with 2000000 docs.
This is my slow sql:
select userId, count(*) from tableA where metricTime>0 group by userId order by cnt limit 10
but it cost 1min to get results.
select userId, count(*) from tableA where metricTime>0 group by userId
@letwefly ES-SQL uses a composite aggregation to run any query that needs an aggregation in it, for the simple reason that it has pagination support. But composite does come with a disadvantage as well: it cannot sort on anything else other than its keys.
In your query case, though, the sorting needs to happen on the bucket size. On the server side (ES) this is not possible, so ES-SQL implemented client-side sorting in this PR. Sorting being performed client-side, I guess this is where the slowdown comes from.
Thanks. And is there anything we can do to speed up the query? Or to by-pass the client-side limit? Anyway what can i do if i want to get the Top-N sorting results in a short time(several seconds not minutes)?
There is no "bypassing" the limit, as this is how sorting by COUNT(*), in this specific case, works.
"Top-N" can mean several things, depending on how you sort the results. As I mentioned, the problem here is sorting by COUNT(*) and aggregating (group by userId) at the same time and there is no way around it, unless you change your query.
I think I was not clear in my previous message - this is actually a limitation in Elasticsearch and not in SQL. Elastisearch doesn't know how to sort the aggregation results by their COUNT. SQL just tries to offer the functionality and the only way it can do this is client-side (in memory).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.