Is it possible to return only the most recent of 'each document'?

Hi Elasticsearch community, total novice here with my first post.

Is it possible to design a query that will return only the most recent of each document in an index?

I'd like to store snapshots of our projects as they change, and then be able to retrieve the nearest-oldest snapshot for each project at any given time.

For example if I stored a snapshot of project 1 and 2 on 2019-03-10, then another snapshot of project 1 on 2019-03-14, my index would have:

{
project_id: 1,
hours_remaining: 10,
timestamp: 2019-03-10
}

{
project_id: 2,
hours_remaining: 20,
timestamp: 2019-03-10
}

{
project_id: 1;
hours_remaining: 5;
timestamp: 2019-03-14
}

Then if I queried with a timestamp filter of 2019-03-15 I'd get the last two records, but if I used a timestamp filter of 2019-03-13 I'd get the first two.

Is this possible? I've read about and discounted versioning, which this is sort of similar too. I've read about time series and event series as well, but I'm not sure if this counts as that kind of data

Any advice greatly appreciated. I'm familiar with SQL, but totally new to the Elastic world.

Regards
Daniel

1 Like

Hi Daniel, thanks for your interest in Elasticsearch!

If you just want one of the projects at a time, this is very straightforward - you just make a query sorted by the date:

GET testindex/_search
{
  "query": {
    "match": {
      "project_id": "1"
    }
  },
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ],
  "size": 1
}

If you want to retrieve the most recent document for all projects in one query, you can use Field Collapsing like so:

GET testindex/_search
{
  "size": 10, 
  "query": {
    "match_all": {}
  },
  "collapse": {
    "field": "project_id",
    "inner_hits": {
      "name": "most_recent",
      "size": 1,
      "sort": [{"timestamp": "desc"}]
    }
  }
}

You'll have to adjust the size parameter based on the number of projects you have, of course, and the response is a bit verbose:

{
  //some fields elided for readability
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "testindex",
        "_type" : "_doc",
        "_id" : "MP0idGkBCYiTQNOubpiI",
        "_score" : 1.0,
        "_source" : {
          "project_id" : "1",
          "hours_remaining" : 10,
          "timestamp" : "2019-03-10"
        },
        "fields" : {
          "project_id" : [
            "1"
          ]
        },
        "inner_hits" : {
          "most_recent" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "testindex",
                  "_type" : "_doc",
                  "_id" : "Mv0idGkBCYiTQNOubpiI",
                  "_score" : null,
                  "_source" : {
                    "project_id" : "1",
                    "hours_remaining" : 5,
                    "timestamp" : "2019-03-14"
                  },
                  "sort" : [
                    1552521600000
                  ]
                }
              ]
            }
          }
        }
      },
      {
        "_index" : "testindex",
        "_type" : "_doc",
        "_id" : "Mf0idGkBCYiTQNOubpiI",
        "_score" : 1.0,
        "_source" : {
          "project_id" : "2",
          "hours_remaining" : 20,
          "timestamp" : "2019-03-10"
        },
        "fields" : {
          "project_id" : [
            "2"
          ]
        },
        "inner_hits" : {
          "most_recent" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "testindex",
                  "_type" : "_doc",
                  "_id" : "Mf0idGkBCYiTQNOubpiI",
                  "_score" : null,
                  "_source" : {
                    "project_id" : "2",
                    "hours_remaining" : 20,
                    "timestamp" : "2019-03-10"
                  },
                  "sort" : [
                    1552176000000
                  ]
                }
              ]
            }
          }
        }
      }
    ]
  }
}

The thing to pay attention to is the inner_hits field of each hit - the _source directly inside each hit is just the first document encountered for each project_id. So for example, for the first result, you'd look at the value of .hits.hits[0].inner_hits.most_recent.hits.hits[0]._source, for the second, .hits.hits[1].inner_hits.most_recent.hits.hits[0]._source, and so on.

Does that help get you what you need?

It's also worth noting that Elasticsearch does support a limited subset of SQL, although I don't think that it would be helpful in this case as it doesn't support the necessary GROUP BY functions (yet), but it may be helpful as you explore Elasticsearch.

4 Likes

Hi Gordon,
Thank you for the response, appreciate your effort.

It is the second scenario that we're interested in -retrieving the most recent document for all projects.

I've not come across collapsing before, I'll spend some time getting my head around it to see if it will do the trick.

At least I know now that we're not missing an obvious solution!

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.