Elasticsearch lost document updates


(Matija Bruncic) #1

Hi,

We are running ES cluster using ES version 5.6.4 and Java 8. We use scripted_upsert to merge multiple events in same ES document. Each node has ~250 shards and average shard size is 6GB. Indexing is around 20-30k/sec and we use bulk API.
We noticed lately(probably was happening before) that when ES node is restarted we lose some events. Since we are merging multiple events in single ES document we noticed that some parts of document are missing.
After troubleshooting one thing that we noticed is that response for this specific (missing) event had response like this:

{
  "took": 668,
  "errors": false,
...
{
  "update": {
    "_index": "index",
    "_type": "type",
    "_id": "id",
    "_version": 2,
    "result": "updated",
    "_shards": {
      "total": 2,
      "successful": 1,
      "failed": 1,
      "failures": [
        {
          "_index": "index",
          "_shard": 30,
          "_node": "xfiXwVpVS6uZeLLHvKdqwg",
          "reason": {
            "type": "node_not_connected_exception",
            "reason": "[XXX][XX.XX.XX.XX:9300] Node not connected"
          },
          "status": "INTERNAL_SERVER_ERROR",
          "primary": false
        }
      ]
    },
    "status": 200
  }
},

Since result is "updated" we acknowledge this event, but in reality this document was not updated. Should we additionally check that all shards are successful or is this a possible bug?

Kr


(David Turner) #2

The way that documents are replicated in versions prior to 6.0 is quite different from how things work today. Here's a blog post about the changes: https://www.elastic.co/blog/elasticsearch-sequence-ids-6-0. It's quite possible that what you're seeing is fixed by the changes discussed here.

We also found and fixed a subtle issue related to a lost update when a node restarts very recently: https://github.com/elastic/elasticsearch/pull/39584. It should be released in version 6.6.3. It would be interesting to hear if you can reproduce this in 6.6.3 or later.


(Matija Bruncic) #3

Yeah, this could be the issue.

So, if I understood correctly, quick fix for this would be to check every document if some shards have failed? If replica is unavailable I'll get "total": 2, "successful": 1, "failed": 0 or in the case of green index "total": 2, "successful": 2, "failed": 0.
I'll try to reproduce it again on 5.6.4 version and then once again on 6.6.3 version.

One question that is still bothering me is why is "errors": false in root of bulk response in this case, shouldn't it be "errors": true?


(David Turner) #4

I think this means that Elasticsearch accepted every item in your bulk request. If it wasn't hitting a bug then errors: false would mean that every in-sync shard copy has every update: either the copy accepted the update or it was removed from the in-sync set. If a copy is removed from the in-sync set then it has to go through a recovery process to catch up with whatever it missed.

Are you only hitting this issue when restarting nodes, or is it happening more generally? How often are you restarting a node?


(Matija Bruncic) #5

I don't think that we have experienced the issue when all nodes are working correctly, however it may be that we just haven't found updates that have failed.
When I first tried to reproduce the issue it took me only 3 restarts, next time when I tried to reproduce the issue it took 25 restarts.
Can I somehow track whether some replica is in-sync / out-of-sync? Maybe some logger that I could turn on?
Thank you for your help


(David Turner) #6

The short answer is that (if it wasn't hitting a bug) then a shard is marked as in-sync during recovery, and is marked as out-of-sync when it fails to acknowledge an indexing operation, so if its state is STARTED (or POST_RECOVERY) then it's in-sync. There's a bunch more detail about the in-sync set in this blog post.


(Matija Bruncic) #7

Hi,

I created a tests to reproduce it on ES 5.6.4 and 6.6.2. Since 6.6.3 is still unavailable as docker image I wasn't able to test it.
I started 3 ES nodes locally using this docker-compose and started indexing. At the same time I started python script that was checking which node holds primary shard for test index and restart that node.
I managed to reproduce it in 6.6.2 also, results are here.

When ES 6.6.3 is released I can try to reproduce it same way.

Kr