Aggregation on previous aggregation's result

Hi first time poster, probably won't be last =D.

I think Elasticsearch is cool, but I am stuck in trying to translate SQL to DSL and I was wondering if I could get some help.

Nutshell is the following (using random schema, but feature request is similar).

Assuming a table has the following columns:
table_i_am

  • name
  • age
  • country

With the data
{ "name": "john_doe", "age": 25, "country": "moon" }
{ "name": "john_doe", "age": 25, "country": "sun" }
{ "name": "foo_bar", "age": 18, "country": "moon" }
{ "name": "foo_bar", "age": 28, "country": "sun" }

With the SQL ->

select temp.name, count(1) as counter from (select name, age from table_i_am where name in ( 'john_doe', 'foo_bar',...) group by name, age) as temp group by temp.name limit 10

so the derived table in from is to group by name, age for duplicate of name, age, country into name, age and then outer group by with select is to count on the name.

Now I tried out POST /_sql/translate for above query, but whereas the derived table was giving what I desired as a composite agg; the complete query wasn't. I figured it might be due to composite agg is not currently compatible with pipeline aggregations, so I tried to get the derived table DSL and fiddle on my own using scripting agg.

However I hit an issue bucket script aggregation only working on numeric and at this point I wasn't sure if I was doing something wrong or if it isn't supported so decided to reach out to the forum to get pointers.

"aggregations": {
  "by_scripting_group_by": {
    "terms": {
      "script": "doc['name'].value + '_' + doc['age'].value"
    }
  }
}

"buckets" : [
{
  "key" : "john_doe_25",
  "doc_count" : 2
},
{
  "key" : "foo_bar_18",
  "doc_count" : 1
},
{
  "key" : "foo_bar_28",
  "doc_count" : 1
}
]
buckets_path must reference either a number value or a single value numeric metric aggregation, got: [StringTerms] at aggregation [by_scripting_group_by]

Thanks a bunch in advance!

Can I get some pointers?

Most folks on this forum will be more familiar with the Elasticsearch query DSL than with SQL. I for one don't quite understand that SQL statement. Maybe you can describe what the desired output would look like? That may give someone an idea on how to implement that with the DSL.

With ES SQL, subqueries are not quite supported. Simple things might work, but more complex subqueries (like yours) will probably not.

Yeah my bad, I really wasn't clear in above example + motive of this post xD. In nutshell I just wanted to get some feedback from the gurus of the area, since tbh I haven't done such complicated aggregation. Previously did simple one level aggregation, since IMO ES shouldn't be relied upon for complicated levels of aggregation (pls correct me here if wrong o_o).

Anywho, without further ado I tried to translate above SQL to DSL

Assume data

{
"name_field": "john_doe",
"interesting_field": "foo_bar_forever",
"misc_field": "a"
},
{
"name_field": "john_doe",
"interesting_field": "foo_bar_forever",
"misc_field": "b"
},
{
"name_field": "john_doe",
"interesting_field": "hello_world",
"misc_field": "c"
},
{
"name_field": "jane_doe",
"interesting_field": "random",
"misc_field": "d"
}

Composite aggregation

"aggregations" : {
"groupby" : {
"composite" : {
"size" : 1000,
"sources" : [
{
"first_group" : {
"terms" : {
"field" : "name_field",
"missing_bucket" : true,
"order" : "asc"
}
}
},
{
"second_group" : {
"terms" : {
"field" : "interesting_field",
"missing_bucket" : true,
"order" : "asc"
}
}
}
]
}
}
}

Results in ->

{
"key" : {
"name_field" : "john_doe",
"interesting_field" : "foo_bar_forever"
},
"doc_count" : 2
},
{
"key" : {
"name_field" : "john_doe",
"interesting_field" : "hello_world"
},
"doc_count" : 1
},
{
"key" : {
"name_field" : "jane_doe",
"interesting_field" : "random"
},
"doc_count" : 1
}

So first would performs composite aggregation on those 2 fields. Then a follow up aggregation of count on the "name_field" was initially being tinkered with (meaning desired below)

name_field interesting_field_count
john_doe 2 (i.e. foo_bar_forever and hello_world)
jane_doe 1 (i.e. random)

{
"key" : {
"name_field" : "john_doe",
"interesting_field_count" : 2
}
},
{
"key" : {
"name_field" : "jane_doe",
"interesting_field_count" : 1
}
}

I was fiddling in trying to get the 2nd aggregation but found composite agg is not currently compatible with pipeline aggregationshttps://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html#_pipeline_aggregations. Then tried fiddling with scripting and use pipeline aggregation on it but found that bucket script aggregation only working on numerichttps://github.com/elastic/elasticsearch/issues/36642.

Verified that BucketHelpers.java returns a doublehttps://github.com/elastic/elasticsearch/blob/40bcee72be76afed8041dc08f63f51214a1e6d0f/server/src/main/java/org/elasticsearch/search/aggregations/pipeline/BucketHelpers.java#L157.

I wanted to know if above statements of non-support is valid. I mean yeah we can do count on the application code side or change in data model, but wanted 2nd opinion in case I am missing something.

Thanks, this community is awesome (not saying it just to get a better response O_O)!

If I understand you correctly, you are looking to get to that last code snippet? You do not need a pipeline aggregation for that. To get the unique value count of a field, you can use the cardinality aggregation. Something like this would work:

{
  "size": 0,
  "aggregations": {
    "groupby": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "names": {
              "terms": {
                "field": "name_field"
              }
            }
          }
        ]
      },
      "aggs": {
        "interesting_field_count": {
          "cardinality": {
            "field": "interesting_field"
          }
        }
      }
    }
  }
}

It would return:

  "aggregations" : {
    "groupby" : {
      "after_key" : {
        "names" : "john_doe"
      },
      "buckets" : [
        {
          "key" : {
            "names" : "jane_doe"
          },
          "doc_count" : 1,
          "interesting_field_count" : {
            "value" : 1
          }
        },
        {
          "key" : {
            "names" : "john_doe"
          },
          "doc_count" : 3,
          "interesting_field_count" : {
            "value" : 2
          }
        }
      ]
    }
  }

Wow I feel dumb, thanks for the above answer Abdon! I tried to translate SQL -> DSL directly and that was my mistake. Should have just started from DSL upwards. This can be resolved/answered, thanks again!