Best way to manage daily index

Hi guys

We have a requirement as following:

  1. Ingest safety item list into ES daily. The source data is written to S3 to folder with path bucket/{date}/part*.json from a batch job and are the continuously ingested to ES per S3 event.
  2. When a query comes in, we should always search the most recent safety item list to decide whether the item is safe. It means the query should hist the most recent daily index and only when it's completed.

I am spinning my head how to do this in a most robust and automatic way. I have considered

  1. Data stream
  2. Explicitly write index with ingestion job
  3. Rollover API / ILM
  4. Use one index and hold all entries with their timestamp as field

But none of them seem to work hundred percent. E.g. what happens if the query arrives while the ingestion of the data of current day is still in progress (then I won't be able to find the item in newest index)? Also I need consider cases when the ingestion job failed for a particular day (several days), we should not block the query but fall back to the most recent successful previously ingested documents.

I am thinking one way is to let ILM roll over an index after age of 30 minutes (the ingestion process takes around 6 minutes) and always query against the last but one index. The question is then how does the requester know which index is the last but one?

Or is there any approach I can take to solve the problem?

Your requirements does as far as I understand not seem a natural fit for time-based indices or ILM. It would seem more natural to create a custom process to handle this. Create an alias which always points to the index you want to query. Then create a script that uploads the data to a completely new index. This might have a timestamp in the name based on when the upload started. Once this has completed, change the read alias to point to this and remove the old index.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.