Elasticsearch lost document updates

mats990 · March 14, 2019, 10:59am

Hi,

We are running ES cluster using ES version 5.6.4 and Java 8. We use scripted_upsert to merge multiple events in same ES document. Each node has ~250 shards and average shard size is 6GB. Indexing is around 20-30k/sec and we use bulk API.
We noticed lately(probably was happening before) that when ES node is restarted we lose some events. Since we are merging multiple events in single ES document we noticed that some parts of document are missing.
After troubleshooting one thing that we noticed is that response for this specific (missing) event had response like this:

{
  "took": 668,
  "errors": false,
...
{
  "update": {
    "_index": "index",
    "_type": "type",
    "_id": "id",
    "_version": 2,
    "result": "updated",
    "_shards": {
      "total": 2,
      "successful": 1,
      "failed": 1,
      "failures": [
        {
          "_index": "index",
          "_shard": 30,
          "_node": "xfiXwVpVS6uZeLLHvKdqwg",
          "reason": {
            "type": "node_not_connected_exception",
            "reason": "[XXX][XX.XX.XX.XX:9300] Node not connected"
          },
          "status": "INTERNAL_SERVER_ERROR",
          "primary": false
        }
      ]
    },
    "status": 200
  }
},

Since result is "updated" we acknowledge this event, but in reality this document was not updated. Should we additionally check that all shards are successful or is this a possible bug?

Kr

DavidTurner · March 14, 2019, 12:33pm

The way that documents are replicated in versions prior to 6.0 is quite different from how things work today. Here's a blog post about the changes: https://www.elastic.co/blog/elasticsearch-sequence-ids-6-0. It's quite possible that what you're seeing is fixed by the changes discussed here.

We also found and fixed a subtle issue related to a lost update when a node restarts very recently: https://github.com/elastic/elasticsearch/pull/39584. It should be released in version 6.6.3. It would be interesting to hear if you can reproduce this in 6.6.3 or later.

mats990 · March 14, 2019, 3:18pm

Yeah, this could be the issue.

So, if I understood correctly, quick fix for this would be to check every document if some shards have failed? If replica is unavailable I'll get "total": 2, "successful": 1, "failed": 0 or in the case of green index "total": 2, "successful": 2, "failed": 0.
I'll try to reproduce it again on 5.6.4 version and then once again on 6.6.3 version.

One question that is still bothering me is why is "errors": false in root of bulk response in this case, shouldn't it be "errors": true?

DavidTurner · March 14, 2019, 3:30pm

I think this means that Elasticsearch accepted every item in your bulk request. If it wasn't hitting a bug then errors: false would mean that every in-sync shard copy has every update: either the copy accepted the update or it was removed from the in-sync set. If a copy is removed from the in-sync set then it has to go through a recovery process to catch up with whatever it missed.

Are you only hitting this issue when restarting nodes, or is it happening more generally? How often are you restarting a node?

mats990 · March 14, 2019, 3:49pm

I don't think that we have experienced the issue when all nodes are working correctly, however it may be that we just haven't found updates that have failed.
When I first tried to reproduce the issue it took me only 3 restarts, next time when I tried to reproduce the issue it took 25 restarts.
Can I somehow track whether some replica is in-sync / out-of-sync? Maybe some logger that I could turn on?
Thank you for your help

DavidTurner · March 15, 2019, 8:17am

The short answer is that (if it wasn't hitting a bug) then a shard is marked as in-sync during recovery, and is marked as out-of-sync when it fails to acknowledge an indexing operation, so if its state is STARTED (or POST_RECOVERY) then it's in-sync. There's a bunch more detail about the in-sync set in this blog post.

mats990 · March 17, 2019, 9:12am

Hi,

I created a tests to reproduce it on ES 5.6.4 and 6.6.2. Since 6.6.3 is still unavailable as docker image I wasn't able to test it.
I started 3 ES nodes locally using this docker-compose and started indexing. At the same time I started python script that was checking which node holds primary shard for test index and restart that node.
I managed to reproduce it in 6.6.2 also, results are here.

When ES 6.6.3 is released I can try to reproduce it same way.

Kr

system · April 14, 2019, 9:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Update/Upsert Performance Improvements Elasticsearch	8	9470	July 5, 2017
Elasticsearch really needs a "update if document exists" ability Elasticsearch	2	1193	June 11, 2018
Missing documents after bulk indexing Elasticsearch	3	1735	July 6, 2017
Bulk update is too slow elasticsearch 6.2 Elasticsearch	25	6823	June 4, 2018
ElasticSearch with > 40 nodes, missing shards and indexing troubles Elasticsearch	11	614	July 6, 2017

Elasticsearch lost document updates

Related topics