Only Process Newly indexed documents

sparklegoat · December 14, 2017, 2:03pm

Is there a way to get the date and time that an elastic search document was written? Looking to run over data hourly and not have to consume data I have already processed in prior runs. The negatives to the options I have are listed in the below but I am trying to exhaust all the possibilities before making an implementation.

I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.

What is the best most efficient way to do this?

I have looked at;

updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6

Christian_Dahlqvist · December 14, 2017, 2:14pm

You could set a timestamp when the document reaches Elasticsearch using an index pipeline.

sparklegoat · December 14, 2017, 2:41pm

Thank you. I am looking into this. Is there a delay from the pipeline util when I can read the document? Also for this to work I would need to walk through a time range right? Last run - now() type of logic.

Christian_Dahlqvist · December 14, 2017, 2:44pm

You can read the document directly by ID, but will have to wait until a refresh takes place before it can be searched. This would require time range logic adjusted based on your refresh interval and how long a refresh usually takes.

sparklegoat · December 14, 2017, 3:11pm

"Ingest metadata is transient and is lost after a document has been processed by the pipeline", so would I even be able to see this field? How would I know which document IDs to process with the field gone?

Christian_Dahlqvist · December 14, 2017, 3:26pm

The example I linked to shows how to copy this timestamp into the document on indexing. You then have it on the document and can retrieve data based on it.

system · January 11, 2018, 3:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to know the time when elasticsearch indexed the data Elasticsearch	10	8564	January 18, 2021
Insert ingest pipeline if id is new Elasticsearch	1	434	April 16, 2019
Dec 12th, 2018: [EN][Elasticsearch] Automatically adding a timestamp to documents Advent Calendar	1	8473	December 1, 2019
Current timestamp Elasticsearch	2	435	December 14, 2018
Does elastic have a timestamp field as part of metadata as to when was the document written to index? Elasticsearch	2	192	March 2, 2024

Only Process Newly indexed documents

Related topics