I'm currently working on Elasticsearch and looking for an option to retrieve only the latest records from the index. Let me break down this query,
We're using AWS Data Stream to load the logs and then, it is sent to Firehose, then to Elasticsearch,
We receive like lakhs of records in a span of few couple of minutes.
Elasticsearch takes 30 secs gap to retrieve a set of records from the Firehose.
So, using Dev Tools, I do the aggregation on the data that I get from the Firehose. Then the curl is shared to the Developer, and then the formatted data is stored in s3 for later use.
The problem is, when I run the query from Dev tool, initially I'd get few hundreds or thousands of records. The next time (after 30 secs) the data gets accumulated, and Total hits increase so on. So, each time I run the script in Dev Tools, I get the old data + new data which i don't want to.
I'm looking for a method, where, when i run the query, I should not get the old records, but only the new one, so I can do aggregation on latest records. Is there any possibilities for this?
Thanks for your response, But would I be able to dynamically change the date range? And, How can I track data availability to make sure there is no data loss. It'd be much helpful if you can share the sample template.
Thanks for your response, If I use the Datefield, would I be able to change it dynamically for each 30 secs? And how guarantee is that there is no data loss on the range that I choose. Please help me if you have any solutions on it.
You can use the now param which represents the current time , that way you always look for latest data , say last one day or you can change it to hours or minutes as per your need .
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.