How Elastic handles frequently changing data

Moni_Hazarika · October 26, 2024, 8:11am

Hello Team,

I have a usecase where I will make an API call to some 3rd party system, then download the dataset to a mounted volume in our kubernetes pod where our service is running. This data will then be index to Elastic for search and AI usecases. I wanted to discuss the challenge of using an emptyDir volume in Kubernetes for state persistence across pod restarts. We are not using PVC. we know that emptyDir volume is ephemeral and will be deleted when the pod is removed, which means any state information (like the last indexed file) would be lost.

Below are some of the approaches for State Persistence:

Use an external storage solution to persist the state (like MongoDB, PostgreSQL, Redis)
Using Kubernetes ConfigMaps or Secrets - This is less common for dynamic data due to size and performance considerations.
Hybrid Approach - Use emptyDir for transient data but persist the last indexed state elsewhere (like a database or external storage). This way, we can have fast access to temporary files while still maintaining state across pod restarts.
Elastic

I want to use elastic and wanted to know your views/thoughts. using Elastic to store the state information for the indexing process will allow to keep everything within the same data ecosystem. Basically create a state document structure that represents the state of the indexing process. This could include fields like indexName, lastIndexedFile, timestamp, etc.
I am a little skeptical on the high frequency update scenario for ES. Keeping the dependencies to a minimum is good so Elastic is better option for state storage for that reason, but how will Elastic handle frequently changing data? Any recommendation in choosing the better approach. Is there any other approach that can be followed here ?
Note: we can have a million API calls to the 3rd party system in 10-20 minutes window. This is not consistent, this is like the max while there could be times where there are 0 calls even in 1 hour.

Christian_Dahlqvist · October 26, 2024, 8:41am

Elasticsearch is IMHO not designed nor optimised for use cases with frequent updates so I would test it and see how it works before committing. It is likely to add a lot of overhead so if you are using the Elasticsearch cluster for other things that may very well be affected.

Moni_Hazarika · October 26, 2024, 8:55am

Thanks @Christian_Dahlqvist for the reply. Basically the data in the actual index is like immutable however the updates I am referring to here is w.r.t having a separate index and document for state management for handling the pod restarts issue and to have a mechanism that allows our service to resume indexing JSON files from the last successfully indexed document right w/o using PVC and use normal mounted volume. In a happy path w/o pod restarts we dont need to go back and forth to ES to get or save the state information, however if we are in the middle of the ingestion process and pod restarts we dont want to do it all over again vs resume from last processed doc

Moni_Hazarika · October 26, 2024, 9:00am

And yes forgot to mention we are using ES for other usecases as well. We have 1 index for fulltextsearch usecase, we have another 3 indices for some kind of metrics data and some other index for AI usecases over and above this new index I am talking about.

Christian_Dahlqvist · October 26, 2024, 9:45am

If you are not frequently updating the documents in Elasticsearch but rather ingesting new documents with the latest state the situation is naturally different and well worth trying. A high indexing load can naturally also have an impact on the cluster , so you need to test this, but it should be less dramatic than a high update scenario.

Moni_Hazarika · October 28, 2024, 9:24am

@Christian_Dahlqvist correct me if I am wrong, isn't our scenario more about adding new state documents rather than frequently updating existing ones. I am thinking of having a dedicated index for state management.
This is what I thought:
In our case, instead of modifying an existing state document, we're looking to insert new state information whenever we process a dataset.
e.g. each time I download and index a new dataset, I would create a new entry in the state index that indicates the last file processed, rather than updating a single document repeatedly. This approach reduces the frequency of updates to a single document, since we're appending new entries? right?
Creating a dedicated index in Elasticsearch for managing state will allow to keep indexing logic separate from the actual raw data storage. basically maintain 1 document per index name in the state_info index. I am thinking we will query the state information for each main index without worrying about multiple state entries

Topic		Replies	Views
Can Elasticsearch on Kubernetes safely reuse data from PVs that were detached after a node went down? Elasticsearch	4	785	May 25, 2021
Elastic search as statefulset in kubernetes cluster Elasticsearch	1	279	November 29, 2022
Elastic cluster in Kubernetes using statefulsets ,data lost during cluster restart Elasticsearch	3	900	December 18, 2018
Deploy Elastic Search with Kubernetes StatefulSet cannot persist the index data Elasticsearch	3	4061	May 4, 2018
Index template lost after recreate es cluster in Kubernetes environment Elasticsearch	4	1234	May 26, 2017

How Elastic handles frequently changing data

Related topics