Only Process Newly indexed documents

Is there a way to get the date and time that an elastic search document was written? Looking to run over data hourly and not have to consume data I have already processed in prior runs. The negatives to the options I have are listed in the below but I am trying to exhaust all the possibilities before making an implementation.

I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.

What is the best most efficient way to do this?

I have looked at;

  • updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
  • index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
  • ??
    Elasticsearch version 5.6

You could set a timestamp when the document reaches Elasticsearch using an index pipeline.

1 Like

Thank you. I am looking into this. Is there a delay from the pipeline util when I can read the document? Also for this to work I would need to walk through a time range right? Last run - now() type of logic.

You can read the document directly by ID, but will have to wait until a refresh takes place before it can be searched. This would require time range logic adjusted based on your refresh interval and how long a refresh usually takes.

1 Like

"Ingest metadata is transient and is lost after a document has been processed by the pipeline", so would I even be able to see this field? How would I know which document IDs to process with the field gone?

The example I linked to shows how to copy this timestamp into the document on indexing. You then have it on the document and can retrieve data based on it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.