DocId bit flip after routing [Document retrievable by _search, but doesn't return in get api]

Code used for routing test is below.

public void testSimpleMkpGet() {
    assertAcked(prepareCreate("test")
        .addMapping("type1", "field1", "type=keyword,store=true", "field2", "type=keyword,store=true")
        .setSettings(Settings.builder().put("index.refresh_interval", -1)
            .put("index.number_of_shards", 64))
        .addAlias(new Alias("alias")));
    ensureGreen();

    GetResponse response;

    logger.info("--> index doc 1");
    client().prepareIndex("test", "type1", "&36866960708-282941204767833454").setSource("field1", "value1", "field2", "value2").get();
 }

I was then doing debugging IDE to see what is the shardId calculated at
https://github.com/elastic/elasticsearch/blob/v6.3.1/server/src/main/java/org/elasticsearch/cluster/routing/OperationRouting.java#L270

if bits are randomly flipping like that then your system would surely not be very stable?

Only 10 docs out of 650M have this issue. All of it has one or the other special char and they are not isolated to any shard or machine (even replicas have it). Only coincidence I am seeing is they were all indexed on the same day in a matter of 6 hours.

Now I am convinced that it is bit flipping since the routing with original value is actually identifying the shard correctly and retrieving the doc for all of the 10 doc ids. I am also pretty sure that client did not explicitly use routing. So, this is some kind of bit flip after the document landed on the box. Given all replicas have it, it might be fair to assume this happened in primary itself before indexing rather than a flip later. So issue must be in Elasticsearch and not Lucene.

Need to think more on how to proceed with root causing here, appreciate any ideas.

Are you using TLS to protect communication between nodes?

No, I am not using TLS between nodes. Do you see that would avoid it? It would arrest the case of bit flip happening over network I guess.

Given 10 docs are in different machines, we can rule out an issue with data node I believe. (multiple nodes having a hardware/kernel issue is unlikely)

What if one of the coordinator had a problem? Say all these 10 docs got coordinated via that node for indexing ( I don't have a way to validate this either). After it does the shardId generation (uses the right id for routing) and before it lands on the actual data node, bit got changed. What is worrying me is why only in "_id" had the impact each time, why only 10 docs over 6 hours when the rate of indexing is much higher. (7.5K per minute)

I believe the coordinator will write sequentially to primary and then to all replicas. Since all the replicas had this flip, it is unlikely a flip in the network since request to each data node will be independent? Must be a bad coordinator node which flipped it before it starts sending it to any data node.

All of the ways I can think of to explain a random bit flip in a document ID are really rather unlikely, but I don't have a good alternative idea either. If we take this as the cause then you are right that it looks like it happened after the coordinating node computed the shard ID.

I think its effects must have been concentrated on document IDs. 650M million doc IDs is only 160 billion bits, so 10 errors there is 1 in 16 billion bits, which is very high. If 1 in 16 billion bits were flipping uniformly across all bits in memory on your machine then it'd be completely unusable.

Document IDs are only really special within Elasticsearch: nothing else in the system distinguishes them from any other data. Yet Elasticsearch doesn't do any manipulation that might risk this kind of corruption, and additionally if it were a software issue then again I think that this kind of error rate would make things completely unusable.

The most significant other correlation I can think of is that these IDs will appear in the same offset in many packets on the network. I have heard of cases of failures of network hardware can consistently affect a particular byte offset in multiple packets, for instance due to environmental factors such as heat or poorly-conditioned power pushing a flaky bit of RAM over the edge. It's still kinda surprising because you'd have to be unlucky for this only to affect document IDs and not to cause messages to be ill-formed in other ways (which would drop the connection, which would be logged). TLS would solve this, because it gives an end-to-end integrity check on the network traffic that means it can detect intermittent failures of intermediate devices.

Thanks for the helping here David. One thing that I want to confirm is how replication works: Does the data node which has primary, forward the doc to the replicas or the coordinator send a new request to replicas after primary returns success? if it is former, we can conclude it is network.

Looks like there is bigger impact than just docId. Just checked the mapping of the index. I am seeing some unintended fields in mapping like "accoujtId" inaddition to intended accountId, "vahueBroad", "va|ueBroad" inaddition to intended "valueBroad", and they all getting a default mapping which means the bits flipped for these fields as well and it did result in dynamic mapping(which confirms user did not send this special value). Given the documents get deleted over time(if docid did not flip), we probably did not see the impact of this.

Yes, the coordinator sends the indexing request(s) to the primary which forwards them on to the replicas. The primary cannot return successfully until the document has been indexed on every in-sync replica. Hence my suspicion about the network between coordinating node and primary.

Thanks for the confirmation, it can be a bad coordinating node memory as well right? Might have corrupted the docid before sending to primary. Isn't it TCP also having a checksum, so ideally even without TLS, it should have caught it?

In theory this is possible, but at the kinds of error rates you're looking at I would have also expected a corruption to hit something more important than these indexing requests and resulted in a SEGV or a kernel panic or something. Maybe you were just lucky.

TCP has a checksum indeed, but it's CRC-16 which is quite weak. It'll find single-bit errors but not necessarily larger ones. This checksum doesn't necessarily protect the data while it's being processed by a router or other intermediate device, as that device may be adjusting the packet and recomputing the checksum on egress.

Thanks. Unfortunately I don't have logs for the time of the event and no way to prove this was machine issue or network. I did confirm that the no hosts were replaced during the time and also there were no errors from Elasticsearch at all for any of the requests.

I am bit concerned that Elasticsearch is not having data integrity check and fails silently in either case(machine or network).

If it was network issue, I agree TLS can protect against this. However there seems to be a window where it can happen on the coordinating node before transport and we don't seem to have any protection. Shouldn't we build some integrity check here to ensure what we wrote to primary is same as what was received as a doc in coordinating node(irrespective of whether TLS was used or not)?

Network-level integrity checks are one of the benefits of TLS. Machine-level integrity checks should be provided by the machine itself by using things like ECC memory and a kernel that terminates the process or halts the machine on detecting an uncorrectable error.

Given the machine had ECC, and there was no errors for Elasticsearch requests (confirmed from the client side logs around that time) and don't see any machine getting crashed or replaced at that time, we can probably rule out the machine issue. That leaves us with only network, we will enable TLS.

Thanks a lot David for sticking with me here and for all the wisdom. It was very helpful.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.