DocId bit flip after routing [Document retrievable by _search, but doesn't return in get api]

Hi,
I have a document indexed a month back in Elasticsearch version 6.3.1. If I try searching by id or any other term in that doc, it returns that doc. However get api responds with found:false indicating that the doc is not present. Why would it happen and how can I debug this?

GET /_search
{ "query" : {
"match" : {
"_id": "&36866960708-282941204767833454"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 64,
"successful" : 11,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
...
"_id" : "&36866960708-282941204767833454",
...

  1. GET _doc/&36866960708-282941204767833454?pretty"
    {
    ...
    "_id" : "&36866960708-282941204767833454",
    "found" : false
    }

I cannot reproduce this, because this GET command does not respond as you say at all:

GET /_doc/&36866960708-282941204767833454?pretty

# 405 Method Not Allowed
# {
#   "status": 405,
#   "error": "Incorrect HTTP method for uri [/_doc/&36866960708-282941204767833454?pretty] and method [GET], allowed: [POST]"
# }

Does the ampersand need escaping to make a valid URL?

It technically does, although that's not the problem. The issue is that _doc isn't a valid index name.

1 Like

Sorry, I redacted the url wrongly. _doc is doctype. full url is something like below

curl -H 'Content-Type:application/json' "localhost:9200/$INDEX/_doc/&36866960708-282941204767833454"

I had tried with url escaping as well, it is the same result.

Ok, my best guess is that you're trying to get the document from a different index from the one it's in (given that you aren't sharing the index names). Another possibility is that you've deleted this document but not refreshed this index yet.

No, index name is same. And it is indexed 30+ days back, ruling out refresh concern.

One more information is that the developer who indexed that doc claims that particular char that he sent in the docid was '6' instead of '&' Indicating a single bit flip some where in the index. There are around 10 docs where one bit seems to have flipped like this and all of them have the same issue (searchable, but not retrievable via get api). Have we seen similar issues?

By the way not able to reproduce this if I try with a new index with that & char in docid.

curl -XPOST -H 'Content-Type:application/json' "localhost:9200/index/_doc/&36866960708-282941204767833454" -d '{"first": "last"}'

{"_index":"index","_type":"_doc","_id":"&36866960708-282941204767833454","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

curl localhost:9200/index/_search
{"took":125,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"index","_type":"_doc","_id":"&36866960708-282941204767833454","_score":1.0,"_source":{"first": "last"}}]}}

curl -H 'Content-Type:application/json' "localhost:9200/index/_doc/&36866960708-282941204767833454"
{"_index":"index","_type":"_doc","_id":"&36866960708-282941204767833454","_version":1,"found":true,"_source":{"first": "last"}}

Ah ok. A bit flip is very surprising. However, if we consider that as a possibility then that document might well be in the wrong shard for its ID, so GET won't be able to find it. If you use the ?routing=XXXX parameter for sufficiently many values of XXXX then do you find it?

I know the particular shard which has this document. I got there by doing search with preference parameter with shard numbers. If the doc has gone to that shard for indexing, it is unlikely that the doc will be present in another shard now right?

how does the get api work? Since search is able to fetch the _source from StoredFields, I believe 'get' also should be able to reach there. How does 'get' differ from 'fetch' that happens in search?

_search searches all the shards in the index, whereas a plain GET just queries the one shard according to the given document ID (or the routing parameter if given).

when I do _search?preference=_only_local, it only searches the local shard. I first arrived at the shard id with preference=shard:x and then went to that host which had shard x and did _only_local to confirm.

Now, I wrote a small Lucene code to do the same query directly on that shard's index directory and interestingly I don't get that doc. Some what validates your theory that Elasticsearch has not refreshed.

I can do a force refresh, but worried that if it solves it, I lost the debugability and may not be able to reproduce the issue. There are 2 cases that I am after. 1) why the bit flipped (will look for some proof with the developer who fed it to ensure the issue is not in the client) 2) Why did refresh not happen for so long. We have frequent 4hourly delete_by_query that deleted all the docs older than 30days and all docs have been getting deleted indicating refresh is actually happening.

Hmm, I had a bug in this test code. I actually see this result from lucene itself in a new index reader, ruling out any issues with refresh

Below code returns the doc.

    QueryParser qp = new QueryParser("accountId", new StandardAnalyzer());
    Query query1 = qp.parse("636866960708");
    QueryParser qp2 = new QueryParser("metricId", new StandardAnalyzer());
    Query query2 = qp2.parse("282941204767833454");
    BooleanQuery booleanQuery = new BooleanQuery.Builder()
        .add(query1, BooleanClause.Occur.MUST)
        .add(query2, BooleanClause.Occur.MUST)
        .build();
    TopDocs hits = searcher.search(booleanQuery, 10);
    System.out.println(hits.totalHits);

Hmm, this is getting interesting. I checked the code path on 'Get' and even VersionsAndSeqNoResolver is getting that particular docid version correctly. How come Elasticsearch 'get' is returning 'false'?

    Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexReader reader = DirectoryReader.open(dir);
    System.out.println(VersionsAndSeqNoResolver.loadDocIdAndVersion(reader,
        new Term(IdFieldMapper.NAME, Uid.encodeId("&36866960708-282941204767833454"))).version);

I don't see the code doing anything other than version conflict check after getting the version.

I think it's because it's not running on the shard that you're currently looking at. In 6.3.1 the shard or shards are calculated here:

Thanks David, I will explore that direction, but was not very convinced earlier since bit flip looked very unlikely :). Essentially what we are saying is docid "&36866960708-282941204767833454" went to shard X initially, but now the routing is taking it to some other shard. For that to happen, docid had to change after the initial routing!

Let me confirm what shard it is sending now and get back!

Index has 64 shards. _search with preference=_shards:34 returns both the docs (none of the other shards have these)
"_id" : "&36866960708-282941204767833454",
"_id" : "636866960708-282941204767833454",

However, OperationRouting class shows the following mapping for index with 64shards.
"_id" : "&36866960708-282941204767833454" -> shardId:30
"_id" : "636866960708-282941204767833454" -> shardId:62

That would explain why the Get API doesn't find it indeed.

The fact that neither of these docs are in the shard that they should be according to their IDs is informative. When indexing this document, was the routing parameter set?

AFAIK, routing was not used. But, no easy way for me to confirm this and I am following up with the developer who has ingested that doc.

Also tried the _search_shards api, wondering why we are seeing different shard number in the cluster compared to the code(I am having the 6.3.1 code). Am I doing something wrong?

/_search_shards?routing=%2636866960708-282941204767833454&pretty" | grep shard

"shards" : [
"shard" : 32,
"shard" : 32,
"shard" : 32,

/_search_shards?routing=636866960708-282941204767833454&pretty" | grep shard
"shards" : [
"shard" : 34,
"shard" : 34,
"shard" : 34,

Those API calls look reasonable. Without looking at the code you're using to simulate this it's hard to say where the problem is.

Given that GET /_search_shards?routing=636866960708-282941204767833454 points us at shard 34, which is the shard in which these docs live, I wonder if these docs were indexed with routing=636866960708-282941204767833454 and an incorrect ID, or whether the strange error replacing the 6 with the & was within Elasticsearch after it had computed the routing. I would be very surprised if it was within Elasticsearch, because if bits are randomly flipping like that then your system would surely not be very stable?