Just wondering if the following behavior can be explained so that we have confidence to continue the rolling restart procedure. We are using v7.4 cluster.
We have stopped all indexing into the cluster before these steps are taken.
Step 2 talks about issuing a _flush/synced command until no failures are seen.
"When you perform a synced flush, check the response to make sure there are no failures. Synced flush operations that fail due to pending indexing operations are listed in the response body, although the request itself still returns a 200 OK status. If there are failures, reissue the request."
We do this and get a situation when there are no failures and then reissue a synced flush 10s later and start seeing failures again? Is this OK, expected? Can we have confidence to continue the rolling restart process?
"_flush/synced?pretty"
Thu 3 Jun 11:30:25 UTC 2021
{
"_shards" : {
"total" : 16708,
"successful" : 16678,
"failed" : 30
},
..
..
Then having a situation when there are no failed and issuing the command again ....(with more detail)
_flush/synced?pretty" | grep -i -B 1 -A 15 failures
Thu 3 Jun 12:09:52 UTC 2021
"failed" : 1,
"failures" : [
{
"shard" : 2,
"reason" : "pending operations",
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "4SO4lMH-TMik7l1rNBhBdA",
"relocating_node" : null,
"shard" : 2,
"index" : "7_0df0304d_01b6_4cba_825f_58f36bbfdb2f-000001",
"allocation_id" : {
"id" : "gCT0LSolRai7XguuBMLvZw"
}
}
Is there anyway to know what these pending operations are? Should we care after a single situation with no failures?
Obviously, something else is going on behind the scenes.
Any other debug/options to help shed light on what is going on?
Looking into elastic code a little more, I see code snippets like the following ..
if (indexWriter.hasUncommittedChanges()) { logger.trace( "can't sync commit [{}]. have pending changes" , syncId); return SyncedFlushResult.PENDING_OPERATIONS;
https://lucene.apache.org/core/7_7_3/core/org/apache/lucene/index/IndexWriter.html#hasUncommittedChanges--
" * Returns true if there may be changes that have not been committed. There are cases where this may return true when there are no actual "real" changes to the index, for example if you've deleted by Term or Query but that Term or Query does not match any documents. Also, if a merge kicked off as a result of flushing a new segment during commit()
, or a concurrent merged finished, this method may return true right after you had just called commit()
."