Why is this delete operation reported as success if the document is still there after refresh?

I'm deleting a bunch a documents in elasticsearch using helpers.streaming_bulk from the python library

for ok, results in helpers.streaming_bulk(
  _es_session.es_client,
  operations,
  max_retries=20,
  raise_on_error=False,
  refresh=True,
):
  action, result = results.popitem()

  logging.warning(
    {"message": "Delete operation", "details": result, "_id": result["_id"]}
  )

In this case operations is a list of deletes that look like

{
  "_index": "xxx",
  "_id": msg_id,
  "_op_type": "delete",
  "_routing": routing,
}

I'm logging the results of every delete operation and I just found one that looks like that

{
  "message": "Delete operation",
  "details": {
    "_index": "xxx",
    "_type": "_doc",
    "_id": "5e66366e42c85a0701986c1e",
    "_version": 416147310420656688,
    "result": "deleted",
    "forced_refresh": true,
    "_shards": {"total": 2, "successful": 2, "failed": 0},
    "_seq_no": 4930576,
    "_primary_term": 1,
    "status": 200
  },
  "_id": "5e66366e42c85a0701986c1e",
  "Module": "root",
  "Severity": "WARNING"
}

As you can see the status is 200, but when I run that query

GET /enriched_reviews_v9/_search
{
  "query": {
    "ids": {"values": ["5e66366e42c85a0701986c1e"]}
  },
  "_source": false
}

it returns the document.

I am missing something? Why is this delete operation reported as success if the document is still there (I'm forcing a refresh)

Are you using routing? If so, did you supply the same routing key when deleting as indexing? What happens if you query using the same routing parameter?

Yes I'm using a rounting key and I'm provifding it correctly. It setup my mapping using

"_routing": {
  "required": true
 }

so it raises an exception if I don't do it. I am thinking about a race condition, but I can't explain why. Maybe if I provide more details you will have a better idea of what is hapening.

I'm having an army of robots (2 to 20) listening to a stream of messages containing _id. I then do a search with a batch of _id to get the corresponding _routing key so I can use both _id and _routing to create delete actions in a bulk operation. I'm adding refresh=true and the bulk call to the change is imediately available to other robots.

It seems that this error is happening when two robots receive the same _id around the same time and try to perform the steps descibed :point_up:

Does that help a bit?

Routing ensures that all documents with the same key end up in the same shard. It does however not mean that documents associated with two different routing keys will end up in different shards. If you have documents with the same id but different routing keys, they can therefore affect each other.

Routing id is not a name space and it is possible to have much much more routing keys than you have shards.

Could this explain what you are seeing?

I don't think so, routing keys are unique for an id. I maybe found something else, related to _version. I can see in the logs that this document was successfully deleted

{
  "message": "Delete operation", 
  "details": {
    "_index": "xxx",
    "_type": "_doc",
    "_id": "5ea608578cf0ecd7dd2d165f",
    "_version": 420329815420844587,
    "result": "deleted",
    "forced_refresh": true,
    "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
    }, 
    "_seq_no": 4821083,
    "_primary_term": 1,
    "status": 200
}

but it is still in the index when I query it using

GET /xxx/_search?routing=5bf52d5b9984980001acdc04
{
  "query": {
    "ids": {"values": ["5ea608578cf0ecd7dd2d165f"]}
  },
  "version": true
}

But the version I have in my index is 420329815420844603 and the version that was successfully deleted was 420329815420844587.

I'm adding more logs to see if someone is inserting the document again in between. I'll let you know if I find something, but it looks like the problem is not a race condition. Thanks a lot for your help

It looks like you are not searching with routing key specified so it is possible the document you are finding is not in the ‘correct’ shard? Try searching with the appropriate key.

Oups sorry, I forgot to include routing in my previous message (just edited it) but I am using routing correctly and have the same problem

Can you create a minimal script that reproduces the problem? That will help us see in detail the steps involved and the parameters used.