Why Elasticsearch-SQL “group by” use composite agg

When I tested the Elasticsearch-SQL feature,I found the “group by” is too slow,why “group by” relies on Elasticsearch’s composite aggregation?
case1:use Elasticsearch-SQL

GET /_sql?format=json
{
  "query": """
select "rolename",max(emotion) mx from "chat-alarm-1"  where gamename = 'zxyd' and logdate = '2020-05-06' and channel = '4'
group by rolename order by mx desc
limit 100
""",
  "request_timeout": "600s"
}

response:

{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout"
}

case2:use Elasticsearch dsl

GET chat-alarm-1/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"gamename": {
"value": "zxyd"
}
}
},
{
"term": {
"logdate": {
"value": "2020-05-06"
}
}
},
{
"term": {
"channel": {
"value": "4"
}
}
}
]
}
},
"aggs": {
"emo": {
"terms": {
"field": "rolename.keyword",
"size": 100,
"order": {
"mx": "desc"
}
},
"aggs": {
"mx": {
"max": {
"field": "emotion"
}
}
}
}
}
}

response:

{
"took" : 5586,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" :
},
"aggregations" : {

As you can see from the above example, Elasticsearch-SQL AGG is too slow

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

It would be great if you could update your post to solve this.

Sorry,I will update

Pay attention to the doc_count_error_upper_bound value in your response - it is -1 meaning the error margin is unknown for the doc counts returned.

Distributed data plays havoc with analytics so returning accurate results is hard when each shard may have different ideas about which selections to return for the final consideration on the coordinating node. This means the reported doc counts in your results can be low. The shard_size setting can make each shard return a larger selection so that the final pruning can include stats from each shard's also-rans but that requires experimentation. I suspect the SQL implementation use composite aggs to exhaustively page through all the values of the top grouping term using multiple requests in order to return accurate doc counts.

I thought the same thing,but the cost “return accurate doc counts” is too expensive。

Always "a pick-2-of 3" conundrum with Fast, Accurate and Big.

I think additional options of SQL ”group by“ such as other agg should be provided for user to choose from. The current "group by" is too slow, even unable to return the results.That was unacceptable,users simply couldn't wait that long.

Hello @ITzhangqiang,

Comparatively to terms, composite aggs offers some key desirable properties that are a requirement for any SQL implementation:

  • it allows pagination through the entire result set;
  • it guarantees correctness in case multiple keys are used for grouping and ordering.

Besides guaranteeing correctness, this also offers consistency: what if an application needs to switch from getting the first top N results to getting the entire data set, so that it would need to switch from one apparently fast execution to a comparatively slower one? So while not every query type requires the above guarantees, the proper solution might not be to offer an alternative translation of SQL to DSL, but rather to optimise the execution times of the DSL.

Without further details it's hard to tell if your term agg returning 100 docs in 5.5+ seconds is appropriate for your data, but it might be worth investigating what takes long in returning the result, if your sharding strategy is correct, the cluster/index settings etc.

Thanks for your reply@bogdan.pintea,
My sharding strategy is correct, the cluster/index settings has no problem.
The reason it is so slow is because of the large amount of data and the large cardinality of the grouped field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.