Is there a way to get the date and time that an elastic search document was written? Looking to run over data hourly and not have to consume data I have already processed in prior runs. The negatives to the options I have are listed in the below but I am trying to exhaust all the possibilities before making an implementation.
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
Thank you. I am looking into this. Is there a delay from the pipeline util when I can read the document? Also for this to work I would need to walk through a time range right? Last run - now() type of logic.
You can read the document directly by ID, but will have to wait until a refresh takes place before it can be searched. This would require time range logic adjusted based on your refresh interval and how long a refresh usually takes.
"Ingest metadata is transient and is lost after a document has been processed by the pipeline", so would I even be able to see this field? How would I know which document IDs to process with the field gone?
The example I linked to shows how to copy this timestamp into the document on indexing. You then have it on the document and can retrieve data based on it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.