Data Loss in Elasticsearch

Hi,

I have a Realtime Streaming Application where we are Updating Elasticsearch index on realtime.
Currently the issue we are facing is data loss in Elasticsearch, please see below example scenario explaining what happened.

Assume we have a document with below content
Client1

-- Name1
-- Name2
-- Name3

We are getting Name4 Insert for Client 1, so our code will fetch the document from ES and then construct JSON by inserting Name4 and writes to Elasticsearch
Expected is :
Client1

-- Name1
-- Name2
-- Name3
-- Name4

Our Code is able to build the above JSON and also we got success response that it wrote data to Elasticsearch in our Application Log.

But in Elasticsearch cluster, we could see GCOverhead warning message during this time.

and now we got Name2 delete immediately, when our code is fetching the data from ES index, we are getting below data.
Client1

-- Name1
-- Name2
-- Name3

ideally, we should get
Client1

-- Name1
-- Name2
-- Name3
-- Name4

Also pleasenote that there is a 1 minute gap between first and second request.

This GCOverhead is causing data loss.
Can some one throw more light on this?

It's not, it's only related in its timing.

Are you using unique _ids when you are inserting and updating? Can you share your code?

Hi Mark,

We are providing 1 second delay between inserts to ensure data is indexed properly and replicated.
Yes we are using unique ids specific for our application for this.
Can you please throw
More light on this??

Are you retrieving the document through a search or a get by Id? Can you show how you perform the update and how you get the resulting document? How many nodes do you have in the cluster? Which version are you using?

We are retreiving the document by search, we have 5 nodes in cluster and version is 5.5
Update is performed by using a document Id which we are setting.

Documents are not immediately being made available for search. It is only after a periodic refresh takes place that documents are published and available for search, although you can always get them by id. The reason for this is performance as a refresh is an expensive operation. You can force a refresh to happen immediately following an indexing request but this can severely affect performance if you index or update a lot. You can also choose to hold off on getting a response until a refresh has happened (default once per second, driven by the index refresh interval) but this will make your index operations take longer.

It is a reasonably common question so you should be able to find it discussed in other threads.

Sure. I can increase wait time in our application where our application waits for 2 seconds until data is refreshed in index. But i have one more query, where i could see the below message in ES logs.
[2017-04-07T13:34:15,260][WARN ][o.e.m.j.JvmGcMonitorService] [FBp7aLX] [gc][5157] overhead, spent [1s] collecting in the last [1.1s]

Does this mean data refresh takes refresh interval (1s default) + 1.1 s (as a result from above log) which sums to 2.2s ??
If this is the case, i may need to increase my wait time in application between ES reads to more that above .
Please confirm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.