Aggregation based on Array Intersection

Charles_Lariviere · October 14, 2020, 3:48pm

Hi,

I'm looking to write a query that will return the "most frequently co-occurring terms given an input of one or more terms", though I'm quite unclear as to how I would structure such a query (or if it is even possible in Elasticsearch).

As an example, given the following documents:

{
    "id": "1",
    "tags": ["planet", "earth", "drawing", "illustration"]
},
{
    "id": "2",
    "tags": ["planet", "saturn", "drawing"]
},
{
    "id": "3",
    "tags": ["planet", "earth", "drawing", "illustration"]
},
{
    "id": "4",
    "tags": ["drawing"]
}

And, given the following input:

["planet"]

I would like to return the terms that occur the most often on documents that have the tag planet as well as some statistics, such that the output would look like:

[
  {"drawing": {"cooccurrence": 3, "total_doc_count": 4}},
  {"earth": {"cooccurrence": 2, "total_doc_count": 2}},
  ...
]

So far, I believe this is totally possible within the Term Vectors API. However, I would like to do the same on array intersection, such that given the input:

["planet", "earth"]

I would like to return the same term frequencies, but for documents intersecting with both planet and earth (i.e. in this example, the term vectors would be filtered for only documents 1 and 3:

[
  {"drawing": {"cooccurrence": 2, "total_doc_count": 4}},
  {"illustration": {"cooccurrence": 2, "total_doc_count": 2}},
  ...
]

Is this possible to achieve within Elasticsearch?

Thanks!
Charles

Mark_Harwood · October 14, 2020, 4:07pm

Check out the significant_terms aggregation. It will help you discover interesting related terms.

Once you have a collection of interesting terms you can issue a follow-up request using the adjacency matrix aggregation to fill in all sorts of details about co-occurrences. Child aggregations can be used to discover for example:

When a pair were first used together
How many times they've been used together, per day, over time
How many different people have used these together
etc

Mark_Harwood · October 14, 2020, 4:53pm

As an example - these are some of the significant terms from Wikipedia articles containing the text "Planet earth" and the adjacency matrix helps us see that these terms form discrete clusters representing the different potential meanings:

Kibana

system · November 11, 2020, 4:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Frequency of significant terms in documents matching a query Elasticsearch	1	324	July 6, 2017
I need to generate a co-occurrence graph Elasticsearch	9	1248	November 1, 2018
Array intersection Elasticsearch	3	1633	March 5, 2019
Calculate term co-occurrence matrix Elasticsearch	3	2471	July 5, 2017
Intersection aggregation Elasticsearch	3	5489	July 17, 2017

Aggregation based on Array Intersection

Related topics