Hi guys
We have a requirement as following:
- Ingest safety item list into ES daily. The source data is written to S3 to folder with path bucket/{date}/part*.json from a batch job and are the continuously ingested to ES per S3 event.
- When a query comes in, we should always search the most recent safety item list to decide whether the item is safe. It means the query should hist the most recent daily index and only when it's completed.
I am spinning my head how to do this in a most robust and automatic way. I have considered
- Data stream
- Explicitly write index with ingestion job
- Rollover API / ILM
- Use one index and hold all entries with their timestamp as field
But none of them seem to work hundred percent. E.g. what happens if the query arrives while the ingestion of the data of current day is still in progress (then I won't be able to find the item in newest index)? Also I need consider cases when the ingestion job failed for a particular day (several days), we should not block the query but fall back to the most recent successful previously ingested documents.
I am thinking one way is to let ILM roll over an index after age of 30 minutes (the ingestion process takes around 6 minutes) and always query against the last but one index. The question is then how does the requester know which index is the last but one?
Or is there any approach I can take to solve the problem?