Aggregate combinations of nested documents

Andy_Gout · July 15, 2021, 1:52pm

Using Elasticsearch, I would like to aggregate combinations of nested documents.

Take a hypothetical index of movie data with these mappings:

{
	mappings: {
		properties: {
			title: {
				type: 'keyword'
			},
			people: {
				type: 'nested',
				properties: {
					id: {
						type: 'keyword'
					},
					name: {
						type: 'keyword'
					},
					role: {
						type: 'keyword'
					}
				}
			}
		}
	}
}

And these docs:

{
	title: "Goodfellas",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '103', name: "Ray Liotta", role: "Actor" },
		{ id: '104', name: "Joe Pesci", role: "Actor" },
		{ id: '105', name: "Frank Vincent", role: "Actor" }
	]
},
{
	title: "Cape Fear",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '106', name: "Nick Nolte", role: "Actor" },
		{ id: '107', name: "Jessica Lange", role: "Actor" }
	]
},
{
	title: "Casino",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '108', name: "Sharon Stone", role: "Actor" },
		{ id: '104', name: "Joe Pesci", role: "Actor" },
		{ id: '105', name: "Frank Vincent", role: "Actor" }
	]
},
{
	title: "Heat",
	people: [
		{ id: '109', name: "Michael Mann", role: "Director" },
		{ id: '110', name: "Al Pacino", role: "Actor" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '111', name: "Val Kilmer", role: "Actor" }
	]
}
{
	title: "The Irishman",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '110', name: "Al Pacino", role: "Actor" },
		{ id: '104', name: "Joe Pesci", role: "Actor" }
	]
}

Is there a way of aggregating pairs of people without having a specific person as a fixed starting point? E.g.

Martin Scorsese and Robert De Niro: 4
Martin Scorsese and Joe Pesci: 3
Robert De Niro and Joe Pesci: 3
Robert De Niro and Al Pacino: 2
Martin Scorsese and Ray Liotta: 1
…

I would also like to:

Specify Director-Actor pairs only, e.g.

Martin Scorsese and Robert De Niro: 4
Martin Scorsese and Joe Pesci: 3
Martin Scorsese and Ray Liotta: 1
Martin Scorsese and Nick Nolte: 1
Michael Mann and Robert De Niro: 1
…

Increase the pairs to triples, quadruples, etc., e.g. triples:

Martin Scorsese and Robert De Niro and Joe Pesci: 3
Martin Scorsese and Robert De Niro and Frank Vincent: 2
Martin Scorsese and Robert De Niro and Ray Liotta: 1
Martin Scorsese and Ray Liotta and Joe Pesci: 2
Robert De Niro and Ray Liotta and Frank Vincent: 2
…

Include the derivation of the combinations (which would perhaps require a multi-level aggregation), e.g.

Martin Scorsese and Robert De Niro: 4 (Goodfellas, Cape Fear, Casino, The Irishman)
Martin Scorsese and Joe Pesci: 3 (Goodfellas, Casino, The Irishman)
Robert De Niro and Joe Pesci: 3 (Goodfellas, Casino, The Irishman)
Robert De Niro and Al Pacino: 2 (Heat, The Irishman)
Martin Scorsese and Ray Liotta: (Goodfellas)

The potential solutions I can think of are:

Calculate the pairs prior to indexing the document and include it as a property that can be used as the term on which to aggregate, e.g. a set of compoundId values that for Goodfellas would be: 101-102, 101-103, 101-104, 102-103, 102-104, 103-104 (though there would need to be some subsequent logic to acquire the corresponding names for the people represented by these IDs).
Write a Painless script that can calculate the pairs at query time, though given the numerous people combinations that each document could have and repeating that for a large amount of data (let's say ~1m documents) it's easily possible that such a query would struggle and not be practical for repeated usage in a live application.

Ideally I'd like to be able to produce these results using a single Elasticsearch aggregation, although appreciate this may not be possible.

What solutions are there to this problem?

Thanks in advance.

system · August 12, 2021, 1:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get most frequent combinations of nested docs Elasticsearch	1	145	September 19, 2023
Aggregations: How to get number of combinations Elasticsearch	1	163	June 13, 2023
Aggregating on a nested field Elasticsearch	5	465	November 15, 2019
Group by aggregation on nested document Elasticsearch	1	339	July 29, 2020
How to aggregate (nested) product attributes Elasticsearch	1	451	November 28, 2020

Aggregate combinations of nested documents

Related topics