Limit search result based on a field


(Christian) #1

Hi there,

I have an index full of episodes with this mapping:

PUT test_episodes

{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1
    }
  },
  "mappings": {
    "_doc": {
      "dynamic": false,
      "properties": {
        "id": {
          "type": "keyword"
        },
        "podcastID": {
          "type": "keyword"
        },
        "createdAt": {
          "type": "date",
          "format": "date_time_no_millis"
        }
      }
    }
  }
}

I want to obtain the latest episodes for a podcast (based on podcastID). I have a query that returns all latest episodes (sorted by createdAt field). How can I limit this to ONE episode per podcastID?

Current query:
GET test_episodes/_search

{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "createdAt": {
                  "gte": "now-14d/d"
                }
              }
            }
          ]
        }
      }
    }
  },
  "sort": {
    "createdAt": {
      "order": "desc"
    }
  }
}

Result:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": null,
    "hits": [
      {
        "_index": "test_episodes",
        "_type": "_doc",
        "_id": "episode-0005",
        "_score": null,
        "_source": {
          "id": "episode-0005",
          "podcastID": "podcast-0002",
          "createdAt": "2018-10-29T17:00:00+00:00"
        },
        "sort": [
          1540832400000
        ]
      },
      {
        "_index": "test_episodes",
        "_type": "_doc",
        "_id": "episode-0002",
        "_score": null,
        "_source": {
          "id": "episode-0002",
          "podcastID": "podcast-0001",
          "createdAt": "2018-10-29T16:00:00+00:00"
        },
        "sort": [
          1540828800000
        ]
      },
      {
        "_index": "test_episodes",
        "_type": "_doc",
        "_id": "episode-0004",
        "_score": null,
        "_source": {
          "id": "episode-0004",
          "podcastID": "podcast-0002",
          "createdAt": "2018-10-29T16:00:00+00:00"
        },
        "sort": [
          1540828800000
        ]
      },
      {
        "_index": "test_episodes",
        "_type": "_doc",
        "_id": "episode-0001",
        "_score": null,
        "_source": {
          "id": "episode-0001",
          "podcastID": "podcast-0001",
          "createdAt": "2018-10-29T15:00:00+00:00"
        },
        "sort": [
          1540825200000
        ]
      },
      {
        "_index": "test_episodes",
        "_type": "_doc",
        "_id": "episode-0003",
        "_score": null,
        "_source": {
          "id": "episode-0003",
          "podcastID": "podcast-0002",
          "createdAt": "2018-10-29T14:00:00+00:00"
        },
        "sort": [
          1540821600000
        ]
      }
    ]
  }
}

Desired outcome result:
Only episode-0005 and episode-0002 should be returned.

Thanks!
Christian


(Gordon Brown) #2

Hi Christian!

You'll need to use a Terms aggregation to bucket by podcastID, then a Top Hits subaggregation, sorted by createdAt and size: 1. That will give you the newest episode for each podcast.


(Christian) #3

Hi Gordon,

Thanks for your help!

I am able to create the buckets as advised; but how can I order them (the buckets) by their episode's date?

GET test_episodes/_search

{
  "size": 0,
  "aggs": {
    "by_podcast": {
      "terms" : {
        "field" : "podcastID"
      },
      "aggs": {
        "latest_episodes": {
          "top_hits": {
            "sort" : [
              { "createdAt" :
                {
                  "order" : "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "*"
                ]
            },
            "size": 1
          }
        }
      }
    }
  }
}

This will return the latest episode for each podcastID. How can I order the buckets now by the createdAt date of their episode?

Thanks again :slight_smile:
Christian


(Gordon Brown) #4

Ah! I didn't realize that was part of it. Since you can't access data from Top Hits in a pipeline aggregation, you'll have to also use a Max aggregation as a subaggregation of the Terms aggregation to bubble up the most recent date.

Then, you can use a Bucket Sort pipeline aggregation to sort the buckets by the newest creation date found in the Max aggregation above.


(Christian) #5

Thanks Gordon!

It seems to work for me as suggested:

GET test_episodes/_search

{
  "size": 0,
  "aggs": {
    "by_podcast": {
      "terms": {
        "field": "podcastID"
      },
      "aggs": {
        "latest_episodes": {
          "top_hits": {
            "sort": [
              {
                "createdAt": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "*"
              ]
            },
            "size": 1
          }
        },
        "recentDate": {
          "max": {
            "field": "createdAt"
          }
        },
        "podcasts_bucket_sort": {
          "bucket_sort": {
            "sort": [
              {
                "recentDate": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

Have a nice week,
Christian


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.