Fetch and store data from an index every hour

Dear community,

I want to create a script that can -

  • Fetch data from a given source index
  • Process the data and store into a destination index

The script will be scheduled to run once every hour.

The data contains timestamp field. My approach is to keep record of "fetched_timestamp" variable which indicates that we need to fetch documents after this timestamp. To implement this, I'm using scroll api with the following query -

"query": {
    "range": {
        "@timestamp": {
            "gte": timestamp
        }
    }
}

This will create a scroll context and with the help of scroll_id, I can fetch all the records from the current snapshot of the source index. After the current snapshot is scrolled out, I will update the "fetched_timestamp" variable to the max timestamp among the documents fetched and will get the next batch of documents.

The problem is that new documents will continue to be added to the source index all day long. It may happen that a document with an older timestamp (back-fill) is added after a newer timestamp in the source index. Since, I am always fetching documents after a certain timestamp therefore I can lose the documents which are inserted recently but have older timestamps.

Is there any way to get all the documents? Any suggestion is very much appreciated.

Thanks in advance.

So, apparently the timestamp of the log creation is not the correct one. Maybe the timestamp of the ingestion into Elasticsearch could be more helpful. See the _ingest timestamp at Ingest pipelines | Elasticsearch Guide [7.13] | Elastic

Maybe something like transforms can help you out, so you don't have to come up with your own logic. Can you explain your use-case in a bit more detail?

I think you understood it correctly. There are two timestamp fields -

  1. request_timestamp: timestamp at which request was made.
  2. created_timestamp: timestamp at which document was ingested.

Earlier, I was trying to use the request_timestamp due to which the issue was raised. I think using created_timestamp is a better approach.

Thanks for the suggestion.