Aggregations within aggregations

Karolinebryn · November 20, 2017, 1:54pm

I have an index that contains documents about movies and the movies are tagged with their genre and main actors. Like this:

{
	"title": "The fast and the furious",
	"tags": [
		{
			"groupName": "Genre",
			"tags": [ "Action", "Racing"]
		},
		{
			"groupName": "Actors",
			"tags": ["Vin Diesel", "Paul Walker"]
		}
	]
},
{
	"title": "2 fast 2 furious",
	"tags": [
		{
			"groupName": "Genre",
			"tags": [ "Action"]
		},
		{
			"groupName": "Actors",
			"tags": ["Paul Walker", "Ludacris"]
		}
	]
}

Now I want to use aggregation to aggregate first on groupName then on the tags-list, in order to create a list like this:

Genre

Action (2)
Racing (1)

Actors

Paul Walker (2)
Vin Diesel (1)
Ludacris (1)

I have tried to do the following aggregation:

POST mymovies/_search
 {
   "query": {
     "match_all": {}
   },
   "aggs": {
     "movie_groupName": {
       "terms": {
         "field": "tags.groupName",
         "size": 20
       },
       "aggs": {
         "movie_tags": {
           "terms": {
             "field": "tags.tags",
             "size": 20
           }
         }
       }
     }
   }
 }

This is the result I expected:

"aggregations": {
	"movie_groupName": {
		"doc_count_error_upper_bound": 0,
		"sum_other_doc_count": 0,
		"buckets": [
			{
				"key": "Genre",
				"doc_count": 2,
				"movice_tags": {
					"doc_count_error_upper_bound": 0,
					"sum_other_doc_count": 0,
					"buckets": 
					[
						{
							"key": "Action",
							"doc_count": 2
						},
						{
							"key": "Racing",
							"doc_count": 1
						}
					]
				}
			},
			{
				"key": "Actors",
				"doc_count": 2,
				"movice_tags": {
					"doc_count_error_upper_bound": 0,
					"sum_other_doc_count": 0,
					"buckets": 
					[
						{
							"key": "Paul Walker",
							"doc_count": 2
						},
						{
							"key": "Vin Diesel",
							"doc_count": 1
						},
						{
							"key": "Ludacris",
							"doc_count": 1
						}
					]
				}
			}
		]
	}
}

But this is the result I get back:

"aggregations": {
	"movie_groupName": {
		"doc_count_error_upper_bound": 0,
		"sum_other_doc_count": 0,
		"buckets": [
			{
				"key": "Genre",
				"doc_count": 2,
				"movice_tags": {
					"doc_count_error_upper_bound": 0,
					"sum_other_doc_count": 0,
					"buckets": 
					[
						{
							"key": "Action",
							"doc_count": 2
						},				
						{
							"key": "Paul Walker",
							"doc_count": 2
						},
						{
							"key": "Racing",
							"doc_count": 1
						},
						{
							"key": "Vin Diesel",
							"doc_count": 1
						},
						{
							"key": "Ludacris",
							"doc_count": 1
						}
					]
				}
			},
			{
				"key": "Actors",
				"doc_count": 2,
				"movice_tags": {
					"doc_count_error_upper_bound": 0,
					"sum_other_doc_count": 0,
					"buckets": 
					[
						{
							"key": "Action",
							"doc_count": 2
						},				
						{
							"key": "Paul Walker",
							"doc_count": 2
						},
						{
							"key": "Racing",
							"doc_count": 1
						},
						{
							"key": "Vin Diesel",
							"doc_count": 1
						},
						{
							"key": "Ludacris",
							"doc_count": 1
						}
					]
				}
			}
		]
	}
}

Giving me a list that will look like this (which clearly is not what I want):

Genre

Paul Walker (2)
Action (2)
Racing (1)
Vin Diesel (1)
Ludacris (1)

Actors

Paul Walker (2)
Action (2)
Racing (1)
Vin Diesel (1)
Ludacris (1)

What am I doing wrong?

Mark_Harwood · November 20, 2017, 2:02pm

Life is probably made a lot easier if your actors are listed in a field called "actors" and your genres in a field called "genres". e.g.

{
    "title": "2 fast 2 furious",
    "actors": ["Paul Walker", "Ludacris"],
    "genres": ["Action"]
}

Karolinebryn · November 20, 2017, 2:04pm

Yeah, I know how to do it like that. But that is not my question. I want the list to be generic.

dadoonet · November 20, 2017, 2:06pm

Probably because your field tags is not a nested field.
But I deeply agree with @Mark_Harwood 's comment!

Mark_Harwood · November 20, 2017, 2:10pm

Prepare yourself for pain then

Your queries, aggs and docs will all need to adopt nested syntax to overcome the cross-matching problem [1] inherent to Lucene.
Kibana will be unable to use these fields.
Your index will bloat in size.

[1] Proposal for nested document support in Lucene | PPT

Karolinebryn · November 20, 2017, 2:10pm

@dadoonet How do I make it a nested field?
The problem with @Mark_Harwood s suggestion is that I do not know what the groupNames might be. A list of movies is just an example to simplify my problem. I want to be able to create a list showing the groupnames and number of documents pr tag without having to do any programming every time a groupName is added or removed.

Karolinebryn · November 20, 2017, 2:15pm

Well this doesn't sound very good
I want to group something within a group... Do you have any other suggestions as how to solve this problem?

Mark_Harwood · November 20, 2017, 2:24pm

People use elasticsearch to do a lot of of grouping groups within groups and then subgroups etc

However, normally the meaning of the data is expressed on the left-hand side of "fieldname": "value" pairs. Life is more complicated if the meaning of "value" depends on a neighbouring field/value pair buried in the same object. We then need you to be explicit about which neighbouring item contains the context etc. using nested. Messy.

How many unique "groupnames" do you expect to accumulate over time? You might be able to use dynamic mapping to add new fields automatically but this shouldn't be used for too many unique field names (where "too many" is probably a mapping with thousands of fields).

Karolinebryn · November 20, 2017, 2:40pm

Not that many. Maybe 10-20 unique "groupNames" at max

Mark_Harwood · November 20, 2017, 2:55pm

Totally manageable then.
Just throw JSON docs with the new fields at elasticsearch and by default it will automatically expand the internal mapping definition used to manage these fields. For example, given this previously-unseen content:

{
   "genre" : ["Action", "Science Fiction"],
    "actors" : ["Vin Diesel", "Ludacris"]
}

..elasticsearch will automatcally add to the internal mapping definition as follows:

{
  "test": {
	"mappings": {
	  "doc": {
		"properties": {
		  "actors": {
			"type": "text",
			"fields": {
			  "keyword": {
				"type": "keyword",
				"ignore_above": 256
			  }
			}
		  },
		  "genre": {
			"type": "text",
			"fields": {
			  "keyword": {
				"type": "keyword",
				"ignore_above": 256
			  }
			}
 ...

You can query on the ".text" indexed versions of fields e.g. actors.text so that you would match a search for vin regardless of case but would use the keyword versions of fields in any aggregation e.g. actor.keyword so that when returning popular actors you'd get a single keyword Vin Diesel rather than the words vin and diesel as separate items.
You can nest terms aggregations arbitrarily e.g. break down by genre.keyword and then the most popular actor.keyword values under that.

system · December 18, 2017, 2:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Writing aggregate with filtering Elasticsearch	5	4957	October 30, 2019
How to filter movies by title and multiple genres, while aggregating genre counts in Elasticsearch? Elasticsearch	4	16	November 5, 2024
Aggregeations with different keys and values (label and ID) Elasticsearch	3	1508	June 28, 2021
Elasticsearch Nested Aggregation Elasticsearch	1	392	September 3, 2018
Aggregation provides multiplied results Elasticsearch	3	365	August 16, 2019

Aggregations within aggregations

Related topics