Looking to get only the latest records from the index

Hi Community,

I'm currently working on Elasticsearch and looking for an option to retrieve only the latest records from the index. Let me break down this query,

  1. We're using AWS Data Stream to load the logs and then, it is sent to Firehose, then to Elasticsearch,
  2. We receive like lakhs of records in a span of few couple of minutes.
  3. Elasticsearch takes 30 secs gap to retrieve a set of records from the Firehose.
  4. So, using Dev Tools, I do the aggregation on the data that I get from the Firehose. Then the curl is shared to the Developer, and then the formatted data is stored in s3 for later use.

The problem is, when I run the query from Dev tool, initially I'd get few hundreds or thousands of records. The next time (after 30 secs) the data gets accumulated, and Total hits increase so on. So, each time I run the script in Dev Tools, I get the old data + new data which i don't want to.
I'm looking for a method, where, when i run the query, I should not get the old records, but only the new one, so I can do aggregation on latest records. Is there any possibilities for this?

Thanks in Advance.

Welcome to the community @Mohamed_Naufal

You can have a date field and use that in your query to look for specific range.

1 Like

Elasticsearch doesn't track state like that for you.
You'd want to look at using a date range query, where you keep track of the last time you ran it.

Hi Warkolm,

Thanks for your response, But would I be able to dynamically change the date range? And, How can I track data availability to make sure there is no data loss. It'd be much helpful if you can share the sample template.

Hi Dinesh,

Thanks for your response, If I use the Datefield, would I be able to change it dynamically for each 30 secs? And how guarantee is that there is no data loss on the range that I choose. Please help me if you have any solutions on it.

You can use the now param which represents the current time , that way you always look for latest data , say last one day or you can change it to hours or minutes as per your need .

For ex.

{
  "query": {
    "range": {
      "timestamp": {
        "gte": "now-1d/d",
        "lt": "now/d"
      }
    }
  }
}

Please refer Range query | Elasticsearch Guide [8.2] | Elastic for more details.

1 Like

Hey Dinesh,

This seems to be a satisfying idea, But I feel there would be minimal amount of data loss! Or what's your thoughts in it?

@Mohamed_Naufal

Can you help us understand What do you mean "minimal amount of data loss!"?

Are you concerned about data being lost / drop during the ingestion process?

Data is not "Lost" during the query process...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.