How Elastic handles frequently changing data

Hello Team,

I have a usecase where I will make an API call to some 3rd party system, then download the dataset to a mounted volume in our kubernetes pod where our service is running. This data will then be index to Elastic for search and AI usecases. I wanted to discuss the challenge of using an emptyDir volume in Kubernetes for state persistence across pod restarts. We are not using PVC. we know that emptyDir volume is ephemeral and will be deleted when the pod is removed, which means any state information (like the last indexed file) would be lost.

Below are some of the approaches for State Persistence:

  • Use an external storage solution to persist the state (like MongoDB, PostgreSQL, Redis)
  • Using Kubernetes ConfigMaps or Secrets - This is less common for dynamic data due to size and performance considerations.
  • Hybrid Approach - Use emptyDir for transient data but persist the last indexed state elsewhere (like a database or external storage). This way, we can have fast access to temporary files while still maintaining state across pod restarts.
  • Elastic

I want to use elastic and wanted to know your views/thoughts. using Elastic to store the state information for the indexing process will allow to keep everything within the same data ecosystem. Basically create a state document structure that represents the state of the indexing process. This could include fields like indexName, lastIndexedFile, timestamp, etc.
I am a little skeptical on the high frequency update scenario for ES. Keeping the dependencies to a minimum is good so Elastic is better option for state storage for that reason, but how will Elastic handle frequently changing data? Any recommendation in choosing the better approach. Is there any other approach that can be followed here ?
Note: we can have a million API calls to the 3rd party system in 10-20 minutes window. This is not consistent, this is like the max while there could be times where there are 0 calls even in 1 hour.

Elasticsearch is IMHO not designed nor optimised for use cases with frequent updates so I would test it and see how it works before committing. It is likely to add a lot of overhead so if you are using the Elasticsearch cluster for other things that may very well be affected.

Thanks @Christian_Dahlqvist for the reply. Basically the data in the actual index is like immutable however the updates I am referring to here is w.r.t having a separate index and document for state management for handling the pod restarts issue and to have a mechanism that allows our service to resume indexing JSON files from the last successfully indexed document right w/o using PVC and use normal mounted volume. In a happy path w/o pod restarts we dont need to go back and forth to ES to get or save the state information, however if we are in the middle of the ingestion process and pod restarts we dont want to do it all over again vs resume from last processed doc

And yes forgot to mention we are using ES for other usecases as well. We have 1 index for fulltextsearch usecase, we have another 3 indices for some kind of metrics data and some other index for AI usecases over and above this new index I am talking about.

If you are not frequently updating the documents in Elasticsearch but rather ingesting new documents with the latest state the situation is naturally different and well worth trying. A high indexing load can naturally also have an impact on the cluster , so you need to test this, but it should be less dramatic than a high update scenario.

@Christian_Dahlqvist correct me if I am wrong, isn't our scenario more about adding new state documents rather than frequently updating existing ones. I am thinking of having a dedicated index for state management.
This is what I thought:
In our case, instead of modifying an existing state document, we're looking to insert new state information whenever we process a dataset.
e.g. each time I download and index a new dataset, I would create a new entry in the state index that indicates the last file processed, rather than updating a single document repeatedly. This approach reduces the frequency of updates to a single document, since we're appending new entries? right?
Creating a dedicated index in Elasticsearch for managing state will allow to keep indexing logic separate from the actual raw data storage. basically maintain 1 document per index name in the state_info index. I am thinking we will query the state information for each main index without worrying about multiple state entries