Aggregate combinations of nested documents

Using Elasticsearch, I would like to aggregate combinations of nested documents.

Take a hypothetical index of movie data with these mappings:

{
	mappings: {
		properties: {
			title: {
				type: 'keyword'
			},
			people: {
				type: 'nested',
				properties: {
					id: {
						type: 'keyword'
					},
					name: {
						type: 'keyword'
					},
					role: {
						type: 'keyword'
					}
				}
			}
		}
	}
}

And these docs:

{
	title: "Goodfellas",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '103', name: "Ray Liotta", role: "Actor" },
		{ id: '104', name: "Joe Pesci", role: "Actor" },
		{ id: '105', name: "Frank Vincent", role: "Actor" }
	]
},
{
	title: "Cape Fear",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '106', name: "Nick Nolte", role: "Actor" },
		{ id: '107', name: "Jessica Lange", role: "Actor" }
	]
},
{
	title: "Casino",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '108', name: "Sharon Stone", role: "Actor" },
		{ id: '104', name: "Joe Pesci", role: "Actor" },
		{ id: '105', name: "Frank Vincent", role: "Actor" }
	]
},
{
	title: "Heat",
	people: [
		{ id: '109', name: "Michael Mann", role: "Director" },
		{ id: '110', name: "Al Pacino", role: "Actor" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '111', name: "Val Kilmer", role: "Actor" }
	]
}
{
	title: "The Irishman",
	people: [
		{ id: '101', name: "Martin Scorsese", role: "Director" },
		{ id: '102', name: "Robert De Niro", role: "Actor" },
		{ id: '110', name: "Al Pacino", role: "Actor" },
		{ id: '104', name: "Joe Pesci", role: "Actor" }
	]
}

Is there a way of aggregating pairs of people without having a specific person as a fixed starting point? E.g.

  • Martin Scorsese and Robert De Niro: 4
  • Martin Scorsese and Joe Pesci: 3
  • Robert De Niro and Joe Pesci: 3
  • Robert De Niro and Al Pacino: 2
  • Martin Scorsese and Ray Liotta: 1

I would also like to:

Specify Director-Actor pairs only, e.g.

  • Martin Scorsese and Robert De Niro: 4
  • Martin Scorsese and Joe Pesci: 3
  • Martin Scorsese and Ray Liotta: 1
  • Martin Scorsese and Nick Nolte: 1
  • Michael Mann and Robert De Niro: 1

Increase the pairs to triples, quadruples, etc., e.g. triples:

  • Martin Scorsese and Robert De Niro and Joe Pesci: 3
  • Martin Scorsese and Robert De Niro and Frank Vincent: 2
  • Martin Scorsese and Robert De Niro and Ray Liotta: 1
  • Martin Scorsese and Ray Liotta and Joe Pesci: 2
  • Robert De Niro and Ray Liotta and Frank Vincent: 2

Include the derivation of the combinations (which would perhaps require a multi-level aggregation), e.g.

  • Martin Scorsese and Robert De Niro: 4 (Goodfellas, Cape Fear, Casino, The Irishman)
  • Martin Scorsese and Joe Pesci: 3 (Goodfellas, Casino, The Irishman)
  • Robert De Niro and Joe Pesci: 3 (Goodfellas, Casino, The Irishman)
  • Robert De Niro and Al Pacino: 2 (Heat, The Irishman)
  • Martin Scorsese and Ray Liotta: (Goodfellas)

The potential solutions I can think of are:

  • Calculate the pairs prior to indexing the document and include it as a property that can be used as the term on which to aggregate, e.g. a set of compoundId values that for Goodfellas would be: 101-102, 101-103, 101-104, 102-103, 102-104, 103-104 (though there would need to be some subsequent logic to acquire the corresponding names for the people represented by these IDs).
  • Write a Painless script that can calculate the pairs at query time, though given the numerous people combinations that each document could have and repeating that for a large amount of data (let's say ~1m documents) it's easily possible that such a query would struggle and not be practical for repeated usage in a live application.

Ideally I'd like to be able to produce these results using a single Elasticsearch aggregation, although appreciate this may not be possible.

What solutions are there to this problem?

Thanks in advance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.