Aggregation based on Array Intersection


I'm looking to write a query that will return the "most frequently co-occurring terms given an input of one or more terms", though I'm quite unclear as to how I would structure such a query (or if it is even possible in Elasticsearch).

As an example, given the following documents:

    "id": "1",
    "tags": ["planet", "earth", "drawing", "illustration"]
    "id": "2",
    "tags": ["planet", "saturn", "drawing"]
    "id": "3",
    "tags": ["planet", "earth", "drawing", "illustration"]
    "id": "4",
    "tags": ["drawing"]

And, given the following input:


I would like to return the terms that occur the most often on documents that have the tag planet as well as some statistics, such that the output would look like:

  {"drawing": {"cooccurrence": 3, "total_doc_count": 4}},
  {"earth": {"cooccurrence": 2, "total_doc_count": 2}},

So far, I believe this is totally possible within the Term Vectors API. However, I would like to do the same on array intersection, such that given the input:

["planet", "earth"]

I would like to return the same term frequencies, but for documents intersecting with both planet and earth (i.e. in this example, the term vectors would be filtered for only documents 1 and 3:

  {"drawing": {"cooccurrence": 2, "total_doc_count": 4}},
  {"illustration": {"cooccurrence": 2, "total_doc_count": 2}},

Is this possible to achieve within Elasticsearch?


Check out the significant_terms aggregation. It will help you discover interesting related terms.

Once you have a collection of interesting terms you can issue a follow-up request using the adjacency matrix aggregation to fill in all sorts of details about co-occurrences. Child aggregations can be used to discover for example:

  1. When a pair were first used together
  2. How many times they've been used together, per day, over time
  3. How many different people have used these together

As an example - these are some of the significant terms from Wikipedia articles containing the text "Planet earth" and the adjacency matrix helps us see that these terms form discrete clusters representing the different potential meanings:


1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.