DocId bit flip after routing [Document retrievable by _search, but doesn't return in get api]

muralikpbhat · July 26, 2019, 11:35am

Hi,
I have a document indexed a month back in Elasticsearch version 6.3.1. If I try searching by id or any other term in that doc, it returns that doc. However get api responds with found:false indicating that the doc is not present. Why would it happen and how can I debug this?

GET /_search
{ "query" : {
"match" : {
"_id": "&36866960708-282941204767833454"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 64,
"successful" : 11,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
...
"_id" : "&36866960708-282941204767833454",
...

GET _doc/&36866960708-282941204767833454?pretty"
{
...
"_id" : "&36866960708-282941204767833454",
"found" : false
}

DavidTurner · July 26, 2019, 12:17pm

I cannot reproduce this, because this GET command does not respond as you say at all:

GET /_doc/&36866960708-282941204767833454?pretty

# 405 Method Not Allowed
# {
#   "status": 405,
#   "error": "Incorrect HTTP method for uri [/_doc/&36866960708-282941204767833454?pretty] and method [GET], allowed: [POST]"
# }

Mark_Harwood · July 26, 2019, 12:21pm

Does the ampersand need escaping to make a valid URL?

DavidTurner · July 26, 2019, 12:25pm

It technically does, although that's not the problem. The issue is that _doc isn't a valid index name.

muralikpbhat · July 26, 2019, 1:13pm

Sorry, I redacted the url wrongly. _doc is doctype. full url is something like below

curl -H 'Content-Type:application/json' "localhost:9200/$INDEX/_doc/&36866960708-282941204767833454"

I had tried with url escaping as well, it is the same result.

DavidTurner · July 26, 2019, 1:22pm

Ok, my best guess is that you're trying to get the document from a different index from the one it's in (given that you aren't sharing the index names). Another possibility is that you've deleted this document but not refreshed this index yet.

muralikpbhat · July 26, 2019, 1:35pm

No, index name is same. And it is indexed 30+ days back, ruling out refresh concern.

One more information is that the developer who indexed that doc claims that particular char that he sent in the docid was '6' instead of '&' Indicating a single bit flip some where in the index. There are around 10 docs where one bit seems to have flipped like this and all of them have the same issue (searchable, but not retrievable via get api). Have we seen similar issues?

By the way not able to reproduce this if I try with a new index with that & char in docid.

curl -XPOST -H 'Content-Type:application/json' "localhost:9200/index/_doc/&36866960708-282941204767833454" -d '{"first": "last"}'

{"_index":"index","_type":"_doc","_id":"&36866960708-282941204767833454","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

curl localhost:9200/index/_search
{"took":125,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"index","_type":"_doc","_id":"&36866960708-282941204767833454","_score":1.0,"_source":{"first": "last"}}]}}

curl -H 'Content-Type:application/json' "localhost:9200/index/_doc/&36866960708-282941204767833454"
{"_index":"index","_type":"_doc","_id":"&36866960708-282941204767833454","_version":1,"found":true,"_source":{"first": "last"}}

DavidTurner · July 26, 2019, 1:41pm

Ah ok. A bit flip is very surprising. However, if we consider that as a possibility then that document might well be in the wrong shard for its ID, so GET won't be able to find it. If you use the ?routing=XXXX parameter for sufficiently many values of XXXX then do you find it?

muralikpbhat · July 26, 2019, 2:01pm

I know the particular shard which has this document. I got there by doing search with preference parameter with shard numbers. If the doc has gone to that shard for indexing, it is unlikely that the doc will be present in another shard now right?

how does the get api work? Since search is able to fetch the _source from StoredFields, I believe 'get' also should be able to reach there. How does 'get' differ from 'fetch' that happens in search?

DavidTurner · July 26, 2019, 2:23pm

_search searches all the shards in the index, whereas a plain GET just queries the one shard according to the given document ID (or the routing parameter if given).

muralikpbhat · July 26, 2019, 2:53pm

when I do _search?preference=_only_local, it only searches the local shard. I first arrived at the shard id with preference=shard:x and then went to that host which had shard x and did _only_local to confirm.

muralikpbhat · July 26, 2019, 3:04pm

Now, I wrote a small Lucene code to do the same query directly on that shard's index directory and interestingly I don't get that doc. Some what validates your theory that Elasticsearch has not refreshed.

I can do a force refresh, but worried that if it solves it, I lost the debugability and may not be able to reproduce the issue. There are 2 cases that I am after. 1) why the bit flipped (will look for some proof with the developer who fed it to ensure the issue is not in the client) 2) Why did refresh not happen for so long. We have frequent 4hourly delete_by_query that deleted all the docs older than 30days and all docs have been getting deleted indicating refresh is actually happening.

muralikpbhat · July 29, 2019, 7:52am

Hmm, I had a bug in this test code. I actually see this result from lucene itself in a new index reader, ruling out any issues with refresh

Below code returns the doc.

    QueryParser qp = new QueryParser("accountId", new StandardAnalyzer());
    Query query1 = qp.parse("636866960708");
    QueryParser qp2 = new QueryParser("metricId", new StandardAnalyzer());
    Query query2 = qp2.parse("282941204767833454");
    BooleanQuery booleanQuery = new BooleanQuery.Builder()
        .add(query1, BooleanClause.Occur.MUST)
        .add(query2, BooleanClause.Occur.MUST)
        .build();
    TopDocs hits = searcher.search(booleanQuery, 10);
    System.out.println(hits.totalHits);

muralikpbhat · July 29, 2019, 1:17pm

Hmm, this is getting interesting. I checked the code path on 'Get' and even VersionsAndSeqNoResolver is getting that particular docid version correctly. How come Elasticsearch 'get' is returning 'false'?

    Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexReader reader = DirectoryReader.open(dir);
    System.out.println(VersionsAndSeqNoResolver.loadDocIdAndVersion(reader,
        new Term(IdFieldMapper.NAME, Uid.encodeId("&36866960708-282941204767833454"))).version);

I don't see the code doing anything other than version conflict check after getting the version.

github.com

elastic/elasticsearch/blob/399d53e5c03ae4157d9ba107d01fbecfed758972/server/src/main/java/org/elasticsearch/index/engine/Engine.java#L606


    SUCCESS,
    COMMIT_MISMATCH,
    PENDING_OPERATIONS
}


protected final GetResult getFromSearcher(Get get, BiFunction<String, SearcherScope, Engine.Searcher> searcherFactory,
                                            SearcherScope scope) throws EngineException {
    final Engine.Searcher searcher = searcherFactory.apply("get", scope);
    final DocIdAndVersion docIdAndVersion;
    try {
        docIdAndVersion = VersionsAndSeqNoResolver.loadDocIdAndVersion(searcher.getIndexReader(), get.uid(), true);
    } catch (Exception e) {
        Releasables.closeWhileHandlingException(searcher);
        //TODO: A better exception goes here
        throw new EngineException(shardId, "Couldn't resolve version", e);
    }


    if (docIdAndVersion != null) {
        if (get.versionType().isVersionConflictForReads(docIdAndVersion.version, get.version())) {
            Releasables.close(searcher);
            throw new VersionConflictEngineException(shardId, get.id(),

DavidTurner · July 29, 2019, 1:27pm

I think it's because it's not running on the shard that you're currently looking at. In 6.3.1 the shard or shards are calculated here:

github.com

elastic/elasticsearch/blob/v6.3.1/server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java#L63-L66


      
          protected ShardIterator shards(ClusterState state, InternalRequest request) {
              return clusterService.operationRouting()
                      .getShards(clusterService.state(), request.concreteIndex(), request.request().id(), request.request().routing(), request.request().preference());
          }

muralikpbhat · July 29, 2019, 1:36pm

Thanks David, I will explore that direction, but was not very convinced earlier since bit flip looked very unlikely :). Essentially what we are saying is docid "&36866960708-282941204767833454" went to shard X initially, but now the routing is taking it to some other shard. For that to happen, docid had to change after the initial routing!

Let me confirm what shard it is sending now and get back!

muralikpbhat · July 29, 2019, 4:37pm

Index has 64 shards. _search with preference=_shards:34 returns both the docs (none of the other shards have these)
"_id" : "&36866960708-282941204767833454",
"_id" : "636866960708-282941204767833454",

However, OperationRouting class shows the following mapping for index with 64shards.
"_id" : "&36866960708-282941204767833454" -> shardId:30
"_id" : "636866960708-282941204767833454" -> shardId:62

DavidTurner · July 29, 2019, 4:58pm

That would explain why the Get API doesn't find it indeed.

The fact that neither of these docs are in the shard that they should be according to their IDs is informative. When indexing this document, was the routing parameter set?

muralikpbhat · July 29, 2019, 5:35pm

AFAIK, routing was not used. But, no easy way for me to confirm this and I am following up with the developer who has ingested that doc.

Also tried the _search_shards api, wondering why we are seeing different shard number in the cluster compared to the code(I am having the 6.3.1 code). Am I doing something wrong?

/_search_shards?routing=%2636866960708-282941204767833454&pretty" | grep shard

"shards" : [
"shard" : 32,
"shard" : 32,
"shard" : 32,

/_search_shards?routing=636866960708-282941204767833454&pretty" | grep shard
"shards" : [
"shard" : 34,
"shard" : 34,
"shard" : 34,

DavidTurner · July 29, 2019, 5:46pm

Those API calls look reasonable. Without looking at the code you're using to simulate this it's hard to say where the problem is.

Given that GET /_search_shards?routing=636866960708-282941204767833454 points us at shard 34, which is the shard in which these docs live, I wonder if these docs were indexed with routing=636866960708-282941204767833454 and an incorrect ID, or whether the strange error replacing the 6 with the & was within Elasticsearch after it had computed the routing. I would be very surprised if it was within Elasticsearch, because if bits are randomly flipping like that then your system would surely not be very stable?

Topic		Replies	Views
Routing bug: no routing document get doc id not found, while set specific routing param found the docs Elasticsearch	8	770	September 25, 2020
GET API doesn't return for some documents even if they exist Elasticsearch	3	294	July 6, 2017
Curl by id does not work although the doc exists Elasticsearch	3	626	July 6, 2017
Doc can be found by query but not by get Elasticsearch	8	1080	September 7, 2018
Inconsistent document existence between search and get? Elasticsearch	3	335	July 6, 2017

DocId bit flip after routing [Document retrievable by _search, but doesn't return in get api]

Related topics