Max() aggregation with group by and then return all fields

hopeng · March 18, 2019, 7:52am

Hi, I have a collection of articles with different author, title and revisions. Can I run a search to get all the articles with the biggest revision in their author+title group (records in bold)?
Author | PublishedDate | Revision | Title
---------------+------------------------+---------------+---------------
James |2019-02-04T00:00:00.000Z|1 |I wonder why
James |2019-03-04T00:00:00.000Z|2 |I wonder why
Parker |2019-03-04T00:00:00.000Z|1 |The Endgame

I tried terms+max aggregation with top_hits. The returned top hits record is not really with the max revision (=2):

{
  "aggs" : {
    "groupByAuthor" : {
      "terms" : {
        "field" : "Author.keyword"
      },
      "aggs": {
        "groupByTitle": {
          "terms" : {
            "field" : "Title.keyword"
          },
          "aggs": {
            "maxRevision": {
              "max": {
                "field": "Revision"
              }
            },
            "top_trades_hits": {
              "top_hits": {
                "size" : 1
              }
            }
          }
        }
      }
    }
  }
}

I'm new to elasticsearch. Sorry if this question is already asked somewhere else, but I did some research and couldn't find the answer. Thank you.

mjunaidmuzammil · March 18, 2019, 8:37am

Can you share the response that you are getting? Are you looking at the aggregations key in response?

Mark_Harwood · March 18, 2019, 8:47am

Hi James.

The max and top_hits are two independent summaries - the max calculation has no influence on the top hits.
You need to use the sort feature in the top_hits aggregation to get the highest revision.

hopeng · March 18, 2019, 10:42am

Hi Mark,

Thank you 'sort' with top_hits solves the issue if there's only one result per group.

I have another problem that each group actually have multiple matching records (in bold) like so:

I need to get all records with max Revision in each group. So I cannot set a fixed 'size' for top_hits. What should I do?

Mark_Harwood · March 18, 2019, 10:49am

Sounds like you're wanting another level of terms aggregation underneath title which is a grouping for the reviewer. So you should have a hierarchy of

terms - author
    terms - title
        terms - reviewer
            top_hits - size 1, sort by date descending

If this ends up being a lot of data for one request you might want to consider using the composite aggregation and using the after param to break it into multiple requests.

hopeng · March 18, 2019, 11:33am

hmm Adding “terms - reviewers” returns records with Revision=1 which is not what I need.

If it’s SQL it will look like:
select * from table group by Auhor, Title having Revision=max(Revision)

I’m keen to know how composite aggregation and “after” will solve the problem. It doesn’t seem straightforward.

Mark_Harwood · March 18, 2019, 11:45am

Are you sorting in the right direction?
Is revision=1 the highest recorded revision for a given author/title/reviewer or are you saying a revision=2 record is missing for one of these combos?

hopeng · March 18, 2019, 12:22pm

This is the source data

curl -X POST "localhost:9200/lib/_doc/_bulk" -H 'Content-Type: application/json' -d'
{ "index":{} }
{"Author": "James",    "Title": "I wonder why",    "Revision": 1, "PublishedDate": "2019-02-04", "Reviewer": "Mary", "ReviewComment": "needs improvement"}
{ "index":{} }
{"Author": "James",    "Title": "I wonder why",    "Revision": 2, "PublishedDate": "2019-03-04", "Reviewer": "Jack", "ReviewComment": "awesome!"}
{ "index":{} }
{"Author": "James",    "Title": "I wonder why",    "Revision": 2, "PublishedDate": "2019-03-04", "Reviewer": "Jones", "ReviewComment": "not bad"}
{ "index":{} }
{"Author": "Parker",    "Title": "The Endgame",    "Revision": 1, "PublishedDate": "2018-12-04", "Reviewer": "Martin", "ReviewComment": "not good"}
{ "index":{} }
{"Author": "Parker",    "Title": "The Endgame",    "Revision": 2, "PublishedDate": "2018-12-14", "Reviewer": "Henry", "ReviewComment": "passed"}
'

This is the search command

{
	"aggs" : {
		"groupByAuthor" : {
			"terms" : { 
				"field" : "Author.keyword"
			},
			"aggs": {
				"groupByTitle": {
					"terms" : { 
						"field" : "Title.keyword"
					},
					"aggs": {
						"groupByReviewer": {
							"terms": {
								"field": "Reviewer.keyword"
							},
							"aggs": {
								"top_trades_hits": {
									"top_hits": {
										 "sort": [
											{
												"Revision": {
													"order": "desc"
												}
											}
										],
										"size" : 1
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

It basically output all 5 records instead of just 2nd, 3rd, 5th records (the records with max Revision group by Author and Title)

Mark_Harwood · March 18, 2019, 12:42pm

Ah OK.
I misunderstood the requirement. I assumed you wanted the last comment from each reviewer.
I now see you want comments from all reviewers on the last revision.

How about this:

GET lib/_search
{
  "size":0,
	"aggs" : {
		"groupByAuthor" : {
			"terms" : { 
				"field" : "Author.keyword"
			},
			"aggs": {
				"groupByTitle": {
					"terms" : { 
						"field" : "Title.keyword"
					},
					"aggs": {
						"groupByRevision": {
							"terms": {
								"field": "Revision",
								"size":1,
								"order": {
								  "_term": "desc"
								}
							},
							"aggs": {
								"top_trades_hits": {
									"top_hits": {
										 "sort": [
											{
												"Revision": {
													"order": "desc"
												}
											}
										],
										"size" : 100
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

hopeng · March 19, 2019, 12:02am

Hi Mark,

Exactly what I needed!
I was evaluating whether ES can do this for the new project, if not the team will go with relational database. But now ES is more likely to be chosen!

The learning curve of building ES queries is probably the 'cons' compared to RDBMS (Hopefully ES SQL support will mature soon). But this is addressed by the professionalism and swift responses of the community. Thank you very much!

Regards,
James

hopeng · March 19, 2019, 1:17am

By the way, what does the initial ("size":0) do? It seems to return the same content without it.

Mark_Harwood · March 19, 2019, 9:47am

For search results elasticsearch normally returns the top 10 matching documents plus any aggregations (think of your typical e-commerce search results with top 10 matching products and summaries of options for refining by price/colour/brand).

In your scenario you only want the aggregations and can dispense with the top-matching documents (so, size=0).

Glad to hear you got it working

system · April 16, 2019, 9:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Top n over Max() aggregation with group by and then return all fields Elasticsearch	2	205	October 11, 2023
Limit results by field + aggregations over results Elasticsearch	1	405	December 25, 2019
Get the latest document version and aggregate the results Elasticsearch	1	1422	July 5, 2017
Top hits aggregation performance issue Elasticsearch	1	945	January 19, 2017
Aggregation over aggregation on another field + top_hits Elasticsearch	2	502	November 4, 2022

Max() aggregation with group by and then return all fields

Related topics