Dear community,
I want to create a script that can -
- Fetch data from a given source index
- Process the data and store into a destination index
The script will be scheduled to run once every hour.
The data contains timestamp field. My approach is to keep record of "fetched_timestamp" variable which indicates that we need to fetch documents after this timestamp. To implement this, I'm using scroll api with the following query -
"query": {
"range": {
"@timestamp": {
"gte": timestamp
}
}
}
This will create a scroll context and with the help of scroll_id, I can fetch all the records from the current snapshot of the source index. After the current snapshot is scrolled out, I will update the "fetched_timestamp" variable to the max timestamp among the documents fetched and will get the next batch of documents.
The problem is that new documents will continue to be added to the source index all day long. It may happen that a document with an older timestamp (back-fill) is added after a newer timestamp in the source index. Since, I am always fetching documents after a certain timestamp therefore I can lose the documents which are inserted recently but have older timestamps.
Is there any way to get all the documents? Any suggestion is very much appreciated.
Thanks in advance.