Using Elasticsearch, I would like to aggregate combinations of nested documents.
Take a hypothetical index of movie data with these mappings:
{
mappings: {
properties: {
title: {
type: 'keyword'
},
people: {
type: 'nested',
properties: {
id: {
type: 'keyword'
},
name: {
type: 'keyword'
},
role: {
type: 'keyword'
}
}
}
}
}
}
And these docs:
{
title: "Goodfellas",
people: [
{ id: '101', name: "Martin Scorsese", role: "Director" },
{ id: '102', name: "Robert De Niro", role: "Actor" },
{ id: '103', name: "Ray Liotta", role: "Actor" },
{ id: '104', name: "Joe Pesci", role: "Actor" },
{ id: '105', name: "Frank Vincent", role: "Actor" }
]
},
{
title: "Cape Fear",
people: [
{ id: '101', name: "Martin Scorsese", role: "Director" },
{ id: '102', name: "Robert De Niro", role: "Actor" },
{ id: '106', name: "Nick Nolte", role: "Actor" },
{ id: '107', name: "Jessica Lange", role: "Actor" }
]
},
{
title: "Casino",
people: [
{ id: '101', name: "Martin Scorsese", role: "Director" },
{ id: '102', name: "Robert De Niro", role: "Actor" },
{ id: '108', name: "Sharon Stone", role: "Actor" },
{ id: '104', name: "Joe Pesci", role: "Actor" },
{ id: '105', name: "Frank Vincent", role: "Actor" }
]
},
{
title: "Heat",
people: [
{ id: '109', name: "Michael Mann", role: "Director" },
{ id: '110', name: "Al Pacino", role: "Actor" },
{ id: '102', name: "Robert De Niro", role: "Actor" },
{ id: '111', name: "Val Kilmer", role: "Actor" }
]
}
{
title: "The Irishman",
people: [
{ id: '101', name: "Martin Scorsese", role: "Director" },
{ id: '102', name: "Robert De Niro", role: "Actor" },
{ id: '110', name: "Al Pacino", role: "Actor" },
{ id: '104', name: "Joe Pesci", role: "Actor" }
]
}
Is there a way of aggregating pairs of people without having a specific person as a fixed starting point? E.g.
- Martin Scorsese and Robert De Niro: 4
- Martin Scorsese and Joe Pesci: 3
- Robert De Niro and Joe Pesci: 3
- Robert De Niro and Al Pacino: 2
- Martin Scorsese and Ray Liotta: 1
- …
I would also like to:
Specify Director-Actor pairs only, e.g.
- Martin Scorsese and Robert De Niro: 4
- Martin Scorsese and Joe Pesci: 3
- Martin Scorsese and Ray Liotta: 1
- Martin Scorsese and Nick Nolte: 1
- Michael Mann and Robert De Niro: 1
- …
Increase the pairs to triples, quadruples, etc., e.g. triples:
- Martin Scorsese and Robert De Niro and Joe Pesci: 3
- Martin Scorsese and Robert De Niro and Frank Vincent: 2
- Martin Scorsese and Robert De Niro and Ray Liotta: 1
- Martin Scorsese and Ray Liotta and Joe Pesci: 2
- Robert De Niro and Ray Liotta and Frank Vincent: 2
- …
Include the derivation of the combinations (which would perhaps require a multi-level aggregation), e.g.
- Martin Scorsese and Robert De Niro: 4 (Goodfellas, Cape Fear, Casino, The Irishman)
- Martin Scorsese and Joe Pesci: 3 (Goodfellas, Casino, The Irishman)
- Robert De Niro and Joe Pesci: 3 (Goodfellas, Casino, The Irishman)
- Robert De Niro and Al Pacino: 2 (Heat, The Irishman)
- Martin Scorsese and Ray Liotta: (Goodfellas)
The potential solutions I can think of are:
- Calculate the pairs prior to indexing the document and include it as a property that can be used as the term on which to aggregate, e.g. a set of
compoundId
values that for Goodfellas would be:101-102
,101-103
,101-104
,102-103
,102-104
,103-104
(though there would need to be some subsequent logic to acquire the corresponding names for the people represented by these IDs). - Write a Painless script that can calculate the pairs at query time, though given the numerous people combinations that each document could have and repeating that for a large amount of data (let's say ~1m documents) it's easily possible that such a query would struggle and not be practical for repeated usage in a live application.
Ideally I'd like to be able to produce these results using a single Elasticsearch aggregation, although appreciate this may not be possible.
What solutions are there to this problem?
Thanks in advance.