Aggregating articles by mentions of games between two dates

I have the following query which returns data from a database of articles published about different video games.

GET articles/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "mediatype_id": 2
          }
        },
        {
          "range": {
            "inserted_at": {
              "gte": "2022-01-01",
              "lte": "2022-01-31"
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}

Here is an example response:

{
              "_index" : "articles",
              "_type" : "_doc",
              "_id" : "1507886006",
              "_score" : 2.0,
              "_source" : {
                "author" : "Shubhankar Parijat",
                "description" : "It's been rumoured that Microsoft could be working on a remastered collection for another one of its franchises, similar to Halo: The Master Chief Collection.",
                "feed_id" : 1919,
                "id" : 1507886006,
                "inserted_at" : "2022-01-25T08:14:27.756538Z",
                "language" : "en",
                "mediatype_id" : 2,
                "pub_date" : "2022-01-25T08:10:45Z",
                "score" : 4.0,
                "source" : 0,
                "title" : "Another Xbox Franchise is “Going to Get the Master Chief Collection Treatment” – Rumour",
                "url" : "https://gamingbolt.com/another-xbox-franchise-is-going-to-get-the-master-chief-collection-treatment-rumour"
              }
            },

If I wanted to find the top mentioned games in a large set of these articles, how could I aggregate that data, and group it in order of games with the most mentions in the set of articles between two dates to the least?

For example, if “Final Fantasy XVI” had 10k articles and “NieR: Automata” had 5k, Final Fantasy has to be first on the list of results.

As you can see there is no 'game' field in the data returned. Maybe the games mentioned in these articles will have to be aggregated somehow from the url/title/description? I am not sure how to do this. For the example above, this article mentions 'Master Chief Collection"

Thank you.

@Matthew_Hammond welcome to the community !
To group the documents with the most mentioned games at the beginning, you can use terms aggregation which, by default, sorts keys (or fields on which aggregation was performed) in descending order (i.e. the number of documents containing that value or key).

For more info, please refer: Terms aggregation | Elasticsearch Guide [8.6] | Elastic

Hi Ayush,

Thank you for your answer,

I can successfully aggregate articles by score with my new query below, using documentation you linked in your post.
However, what if I wanted aggregate my articles by a text field, like 'title', to aggregate articles by game titles?

To do this I have tried setting field data = true and aggregating by title, however this causes a timeout.
Is there another way I should be approaching this?

Many thanks

GET articles/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "mediatype_id": 2
          }
        },
        {
          "range": {
            "inserted_at": {
              "gte": "2022-01-01",
              "lte": "2022-04-31"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "games": {
      "terms": {
        "field": "score",
        "order": { "_count": "desc" }
      }
    }
  }
}

So if I understand correctly, you need to order the games first by score and then by their titles ?
If yes, you can put another child terms aggs to games which would work on title.keyword field.
Also, if you want, you can move the must -> term query to aggs as well since essentially you are only searching for one value, and "terms" aggs have an attribute called includes to only consider those docs where that value is exact match (however, should not be metric/ numeric value).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.