I have noticed that the document indexing rate for a new data node stays higher than all other existing data nodes for few days after it's been introduced into the cluster. Is this expected?
Not wholly surprising, it's possible the new node gets a little more than its fair share of newer shards to start with. It's not reindexing anything tho.
If you look at the 2 graphs below. The new node has finished rebalancing around 11:30pm. You can see the packets received dived around there.
But the second graph clearly shows the indexing rate stayed high.
We thought about the possibility of simply shards moved to new node being busier.
But the indexing rate did not drop for existing nodes (at least not noticeable), which suggest the extra activity is purely the result of "more work" for the moved shards on the new node. And it often lingers for few days. What would this extra work be then? Any pointer will be highly appreciated.
Nobody has any insight?
I don't think those graphs really exclude the possibility that the new node just gets more new (active) indexing shards. At least they're not a very compelling argument. You'd need to look at the shard-level stats to understand which shards/indices are seeing all the indexing activity, and then look at how they're allocated across nodes.
Sure. But does this look like an unique situation to you though?
My experience with multiple (6) clusters in different env (aws, gcp) all exhibiting such behavior.
So there must be a common task/event/thread that's causing this behavior. It could be adding a new node will almost always causing heavily indexed shards to be moved to it, for example.
This is what I'm looking for (if it's a known behavior). Then I can figure out a way to work around it if possible, etc.
Are you suggesting there's no such "common" behavior when adding new nodes that you are aware of?
I haven't ever noticed anything like this, but docs-indexed-per-node is not a metric I pay much attention to so I could just have missed it.
Work around what exactly? Is it a problem, or are you just curious?
We had experienced an issue where cpu for a new data node stayed at 100% (flat line, not spiky) for several minutes during peak hour (several hours after rebalance was finished). Just the new data node. The rest data nodes CPU spiked as well since it's peak hour, but not above 80%.
We did experience missing data during that period of time but no error logs whatsoever. Not in ES nor our writers. The flat line 100% CPU utilization was the only anomaly. The missing data happens to be from a shard residing on that new data node.
This uneven CPU utilization is the potential workaround we are looking to address if we could not figure out what happened to the missing data. To ensure no data nodes get stuck at 100% CPU flat line.
There's a sentiment within our group that adding new data nodes in ES cluster might potentially cause missing data, which I don't believe to be the case. But the data do point to ES dropping data internally for some reason. Even though no error logs were recorded. But the fact that no errors during the 100% flat line period suggests something's not right either. I would've assume some sort of queue full/drop would've been reported. (just guessing here based on experience)
We have upgraded from ES 5 -> 6 -> 7.15. version 7.15.2 is what we are on now (about to upgrade to 8). This is the first time I've ever experienced this.
We recently just added the document indexing rate metric as well, so I can't tell if this has always been the case as a baseline.
I see, yes that sounds like a problem. But you can get load imbalances without imbalances in document count (if one index has much larger/more complex docs than another) and conversely an imbalance in doc count doesn't mean an imbalance in load. I'd suggest tracking CPU utilisation directly if that's the concern.
Recent 8.x versions can take factors like indexing CPU load into account when allocating shards but older versions consider all shards to be equal which can definitely lead to load imbalances.
I guess it depends what exactly you mean by "missing data". ES won't lose any docs that it has acknowledged as successfully written, but it might reject some docs and a faulty client that does not retry rejected docs until success would lose them. Also if the node is CPU-bound then it might take quite some time in between writing a doc and having that doc become visible to searches.
By "reported" do you mean "in the client" or "in the ES server logs"? I would not expect ES to log rejections due to a full queue or other backpressure, it expects the client to handle these things.
But also are you sure the queue was full and rejecting anything? 100% CPU does not imply that anything is being rejected.
True. 100% doesn't mean queue full. But that would be likely IMO.
We do synchronous writes, so there was a theory similar to what you stated. The cluster received the bulkwrite and everything looks good and returns good status. But when it's actually indexing the data, something went wrong and rejected payload. In our case, we hypothesize 100% CPU cause the drop cuz the payloads were fine outside of the 100% cpu flat line time.
We experienced logs regarding hitting queue size limit of 200 before; therefore, we know that queue full is being reported and logged. We didn't see that during 100% cpu becomes unusual to some of us...
This queue full log is being reported both in ES log and bulkWrite REST response.
That doesn't make sense. A good status in the response means that nothing was rejected or otherwise went wrong during indexing.
ok. So that theory would be invalid then.
Does synchronous write response means the data has been indexed successfully then? That doesn't seem to be what I've read. But I'm having trouble finding where I get that info from.
If indexing a bulkwrite takes 10 seconds, bulkwrite would return after 10 seconds?
Yes, writes into Elasticsearch are synchronous and only return when complete.