Hello Team,
I have a usecase where I will make an API call to some 3rd party system, then download the dataset to a mounted volume in our kubernetes pod where our service is running. This data will then be index to Elastic for search and AI usecases. I wanted to discuss the challenge of using an emptyDir volume in Kubernetes for state persistence across pod restarts. We are not using PVC. we know that emptyDir volume is ephemeral and will be deleted when the pod is removed, which means any state information (like the last indexed file) would be lost.
Below are some of the approaches for State Persistence:
- Use an external storage solution to persist the state (like MongoDB, PostgreSQL, Redis)
- Using Kubernetes ConfigMaps or Secrets - This is less common for dynamic data due to size and performance considerations.
- Hybrid Approach - Use emptyDir for transient data but persist the last indexed state elsewhere (like a database or external storage). This way, we can have fast access to temporary files while still maintaining state across pod restarts.
- Elastic
I want to use elastic and wanted to know your views/thoughts. using Elastic to store the state information for the indexing process will allow to keep everything within the same data ecosystem. Basically create a state document structure that represents the state of the indexing process. This could include fields like indexName, lastIndexedFile, timestamp, etc.
I am a little skeptical on the high frequency update scenario for ES. Keeping the dependencies to a minimum is good so Elastic is better option for state storage for that reason, but how will Elastic handle frequently changing data? Any recommendation in choosing the better approach. Is there any other approach that can be followed here ?
Note: we can have a million API calls to the 3rd party system in 10-20 minutes window. This is not consistent, this is like the max while there could be times where there are 0 calls even in 1 hour.