Max() aggregation with group by and then return all fields


(James H) #1

Hi, I have a collection of articles with different author, title and revisions. Can I run a search to get all the articles with the biggest revision in their author+title group (records in bold)?
Author | PublishedDate | Revision | Title
---------------+------------------------+---------------+---------------
James |2019-02-04T00:00:00.000Z|1 |I wonder why
James |2019-03-04T00:00:00.000Z|2 |I wonder why
Parker |2019-03-04T00:00:00.000Z|1 |The Endgame

I tried terms+max aggregation with top_hits. The returned top hits record is not really with the max revision (=2):

{
  "aggs" : {
    "groupByAuthor" : {
      "terms" : {
        "field" : "Author.keyword"
      },
      "aggs": {
        "groupByTitle": {
          "terms" : {
            "field" : "Title.keyword"
          },
          "aggs": {
            "maxRevision": {
              "max": {
                "field": "Revision"
              }
            },
            "top_trades_hits": {
              "top_hits": {
                "size" : 1
              }
            }
          }
        }
      }
    }
  }
}

I'm new to elasticsearch. Sorry if this question is already asked somewhere else, but I did some research and couldn't find the answer. Thank you.


(Junaid) #2

Can you share the response that you are getting? Are you looking at the aggregations key in response?


(Mark Harwood) #3

Hi James.

The max and top_hits are two independent summaries - the max calculation has no influence on the top hits.
You need to use the sort feature in the top_hits aggregation to get the highest revision.


(James H) #5

Hi Mark,

Thank you 'sort' with top_hits solves the issue if there's only one result per group.

I have another problem that each group actually have multiple matching records (in bold) like so:

Title | Author | PublishedDate | Revision | Reviewer | ReviewComment
---------------+---------------+------------------------+---------------+---------------+-----------------
The Endgame |Parker |2018-12-04|1 |Martin |not good
The Endgame |Parker |2018-12-14|2 |Henry |passed
I wonder why |James |2019-02-04|1 |Mary |needs improvement
I wonder why |James |2019-03-04|2 |Jack |awesome!
I wonder why |James |2019-03-04|2 |Jones |not bad

I need to get all records with max Revision in each group. So I cannot set a fixed 'size' for top_hits. What should I do?


(Mark Harwood) #6

Sounds like you're wanting another level of terms aggregation underneath title which is a grouping for the reviewer. So you should have a hierarchy of

terms - author
    terms - title
        terms - reviewer
            top_hits - size 1, sort by date descending

If this ends up being a lot of data for one request you might want to consider using the composite aggregation and using the after param to break it into multiple requests.


(James H) #7

hmm Adding “terms - reviewers” returns records with Revision=1 which is not what I need.

If it’s SQL it will look like:
select * from table group by Auhor, Title having Revision=max(Revision)

I’m keen to know how composite aggregation and “after” will solve the problem. It doesn’t seem straightforward.


(Mark Harwood) #8

Are you sorting in the right direction?
Is revision=1 the highest recorded revision for a given author/title/reviewer or are you saying a revision=2 record is missing for one of these combos?


(James H) #9

This is the source data

curl -X POST "localhost:9200/lib/_doc/_bulk" -H 'Content-Type: application/json' -d'
{ "index":{} }
{"Author": "James",    "Title": "I wonder why",    "Revision": 1, "PublishedDate": "2019-02-04", "Reviewer": "Mary", "ReviewComment": "needs improvement"}
{ "index":{} }
{"Author": "James",    "Title": "I wonder why",    "Revision": 2, "PublishedDate": "2019-03-04", "Reviewer": "Jack", "ReviewComment": "awesome!"}
{ "index":{} }
{"Author": "James",    "Title": "I wonder why",    "Revision": 2, "PublishedDate": "2019-03-04", "Reviewer": "Jones", "ReviewComment": "not bad"}
{ "index":{} }
{"Author": "Parker",    "Title": "The Endgame",    "Revision": 1, "PublishedDate": "2018-12-04", "Reviewer": "Martin", "ReviewComment": "not good"}
{ "index":{} }
{"Author": "Parker",    "Title": "The Endgame",    "Revision": 2, "PublishedDate": "2018-12-14", "Reviewer": "Henry", "ReviewComment": "passed"}
'

This is the search command

{
	"aggs" : {
		"groupByAuthor" : {
			"terms" : { 
				"field" : "Author.keyword"
			},
			"aggs": {
				"groupByTitle": {
					"terms" : { 
						"field" : "Title.keyword"
					},
					"aggs": {
						"groupByReviewer": {
							"terms": {
								"field": "Reviewer.keyword"
							},
							"aggs": {
								"top_trades_hits": {
									"top_hits": {
										 "sort": [
											{
												"Revision": {
													"order": "desc"
												}
											}
										],
										"size" : 1
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

It basically output all 5 records instead of just 2nd, 3rd, 5th records (the records with max Revision group by Author and Title)


(Mark Harwood) #10

Ah OK.
I misunderstood the requirement. I assumed you wanted the last comment from each reviewer.
I now see you want comments from all reviewers on the last revision.

How about this:

GET lib/_search
{
  "size":0,
	"aggs" : {
		"groupByAuthor" : {
			"terms" : { 
				"field" : "Author.keyword"
			},
			"aggs": {
				"groupByTitle": {
					"terms" : { 
						"field" : "Title.keyword"
					},
					"aggs": {
						"groupByRevision": {
							"terms": {
								"field": "Revision",
								"size":1,
								"order": {
								  "_term": "desc"
								}
							},
							"aggs": {
								"top_trades_hits": {
									"top_hits": {
										 "sort": [
											{
												"Revision": {
													"order": "desc"
												}
											}
										],
										"size" : 100
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

(James H) #11

Hi Mark,

Exactly what I needed!
I was evaluating whether ES can do this for the new project, if not the team will go with relational database. But now ES is more likely to be chosen!

The learning curve of building ES queries is probably the 'cons' compared to RDBMS (Hopefully ES SQL support will mature soon). But this is addressed by the professionalism and swift responses of the community. Thank you very much!

Regards,
James


(James H) #12

By the way, what does the initial ("size":0) do? It seems to return the same content without it.


(Mark Harwood) #13

For search results elasticsearch normally returns the top 10 matching documents plus any aggregations (think of your typical e-commerce search results with top 10 matching products and summaries of options for refining by price/colour/brand).

In your scenario you only want the aggregations and can dispense with the top-matching documents (so, size=0).

Glad to hear you got it working


(system) closed #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.