Unable to complete Synced Flush on idle index

Hello,

I'm using curator for index maintenance (time based log indices) and it's sometimes failing to close indices before proceeding with deletion.

The error message emitted is ongoing operations on primary. The only related documentation I could find for this is on the Synced Flush page. When I request the related shard's stats I see it's totally missing a sync_id. Note that this happens on indices that are old and get no more data so there are no "ongoing operations" due to new data coming in. Also when this happens for some shard it won't stop happening.

What's extra weird to me is that if I POST a _close for the index, it does complete successfully and if I rerun curator it will happily complete it's job.

Is there any more in the logs? Without debug logging enabled, and possibly disabling blacklisting, I won't be able to tell you what is happening.

What are the timestamps in the logs? In other words, what is the time difference between when the _close request is made and timestamp on the error message? Is it immediate? Or is there a delay? If so, how big is the delay?

Here are the two relevant curator log lines:

2018-02-05 07:31:55,944 INFO      Closing selected indices: ['wsjiis-2018.01.28', 'wsjiis-2018.01.26', 'wsjiis-2018.01.22', 'wsjiis-2018.01.25', 'wsjiis-2018.01.23', 'wsjiis-2018.01.21', 'wsjiis-2018.01.27', 'wsjiis-2018.01.24']
2018-02-05 07:31:56,013 ERROR     Failed to complete action: close.  : Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(409, '{"_shards":{"total":16,"successful":15,"failed":1},"wsjiis-2018.01.21":{"total":2,"successful":1,"failed":1,"failures":[{"shard":1,"reason":"[1] ongoing operations on primary"}]},"wsjiis-2018.01.24":{"total":2,"successful":2,"failed":0},"wsjiis-2018.01.25":{"total":2,"successful":2,"failed":0},"wsjiis-2018.01.22":{"total":2,"successful":2,"failed":0},"wsjiis-2018.01.23":{"total":2,"successful":2,"failed":0},"wsjiis-2018.01.28":{"total":2,"successful":2,"failed":0},"wsjiis-2018.01.26":{"total":2,"successful":2,"failed":0},"wsjiis-2018.01.27":{"total":2,"successful":2,"failed":0}}')

The timing seems pretty instant between them. Changed to DEBUG level when I was testing it but it showed nothing more interesting. There was a lot of extra logs regarding the index selection but nothing extra between the closing log line and the fail one. That could be due to the blacklist but I can only test that when/if this happens again.

Seeing curator's code here it seems to me that the error from elasticsearch comes from the synced flush operation that curator wants to do on the index before closing it. That's why my manual call on _close works, I did not do a manual _flush first which would probably also fail with that same error.

So the question for curator is should it try to flush the index before closing it? And the question for elasticsearch is why can't I flush my index?

Yes, you really do want a synced flush before closing. If you don't have that, then you will have to rebuild the entire replica after re-opening the closed index. That this is so undesirable is why Curator will attempt to perform a synced flush before closing. The mystery here is why it's not already synced.

Elasticsearch, behind the scenes, will automatically attempt a synced flush on indices idle for a mere 5 minutes. In other words, if your index is actually idle, it should already be synced between the primary and all replicas. If it is not, that might explain why it fails when Curator attempts to verify that it is synced before closing.

I should probably add code to continue with closing even if the 409 comes through, and log the error. But this is not common. Very not common. Most users never attempt to close active indices, and as mentioned, Elasticsearch attempts to automatically perform the synced flush on them after 5 minutes of idle time. That this is happening to you—idle indices not able to be synced_flush—is definitely a mystery worth understanding. If you would be so kind as to change the title of this topic to "Elasticsearch unable to do synced_flush on idle indices" we can get eyes on this from more people on the Elasticsearch team. Yes, Curator is experiencing the result, but Curator isn't the problem here. It just calls the API, and something under the hood is failing to connect the dots.

Nice, thanks for helping get to the bottom of this, I have changed the title as you suggested. Now I restored from backup one of the indices that had the problem and I saw again the same thing, shard 1 was still missing a sync id, indeed suggesting it didn't flush on its own:

{
  "indices": {
    "wsjiis-2018.01.21": {
      "shards": {
        "0": [
          {
            "commit": {
              "id": "XemNxYf37qUO8/tesSgBqA==",
              "generation": 15,
              "user_data": {
                "local_checkpoint": "14136041",
                "max_unsafe_auto_id_timestamp": "1516492801248",
                "translog_uuid": "Hj9op1G_QOW7WBrZMq65Nw",
                "history_uuid": "GpWqTk4BQXuHe7KtyTWLUw",
                "sync_id": "SNseJ0k-TUSgwATltPcHiA",
                "translog_generation": "1",
                "max_seq_no": "14136041"
              },
              "num_docs": 14136042
            }
          }
        ],
        "1": [
          {
            "commit": {
              "id": "BGwoqdC1ik2i1k65ZtpPuA==",
              "generation": 14,
              "user_data": {
                "local_checkpoint": "14140162",
                "max_unsafe_auto_id_timestamp": "1516492801248",
                "translog_uuid": "jCNYUoUTSOeMeJNxcIs48Q",
                "history_uuid": "kQIVkQzZR4615NPspXgGQA",
                "translog_generation": "1",
                "max_seq_no": "14140162"
              },
              "num_docs": 14140163
            }
          }
        ]
      }
    }
  }
}

I then tried a manual /_flush/synced and it completed successfully. Probably closing it, backing it up and restoring it cleared whatever was making it think it had ongoing operations. Here's the _stats afterwards:

{
  "indices": {
    "wsjiis-2018.01.21": {
      "shards": {
        "0": [
          {
            "commit": {
              "id": "XemNxYf37qUO8/tesSgf9w==",
              "generation": 16,
              "user_data": {
                "local_checkpoint": "14136041",
                "max_unsafe_auto_id_timestamp": "1516492801248",
                "translog_uuid": "Hj9op1G_QOW7WBrZMq65Nw",
                "history_uuid": "GpWqTk4BQXuHe7KtyTWLUw",
                "sync_id": "HobgXxZBSyqfr7eKr9wVrg",
                "translog_generation": "1",
                "max_seq_no": "14136041"
              },
              "num_docs": 14136042
            }
          }
        ],
        "1": [
          {
            "commit": {
              "id": "BGwoqdC1ik2i1k65Ztpt9Q==",
              "generation": 15,
              "user_data": {
                "local_checkpoint": "14140162",
                "max_unsafe_auto_id_timestamp": "1516492801248",
                "translog_uuid": "jCNYUoUTSOeMeJNxcIs48Q",
                "history_uuid": "kQIVkQzZR4615NPspXgGQA",
                "sync_id": "MYtp2toUSlKZijBHaqfczg",
                "translog_generation": "1",
                "max_seq_no": "14140162"
              },
              "num_docs": 14140163
            }
          }
        ]
      }
    }
  }
}

I still have one index backed up that should be in the same state with no sync id in case someone has any suggestion for any different testing on it. Otherwise we will have to wait for the issue to appear again with a new index that would also fail to do a synced flush.

IIRC the issue has appeared three times in total and only after upgrading to 6.0.

That's a key detail. How many segments on this index? Can it be forceMerged? That would force the segments to change, and that might help.

Another possibility is that the shards are from an older version of Lucene, which could have something to do with it. If the forceMerge doesn't help, the other possibility is to do a reindex, which would put your data into the version of Lucene that Elasticsearch is using.

To be more specific the cluster is at version 6.1.1. The upgrade from 5 was not made on the same machines, but with a fresh install on new ones. These indices that have had the issue were all created on the 6.1 cluster, not migrated from a previous version.

Every day we use curator to forcemerge yesterday's indices down to one segment per shard. On the 22nd the index in question from the 21st seems to have been successfully forcemerged in just under 6 minutes:

2018-01-22 06:40:32,464 INFO      Trying Action ID: 2, "forcemerge": Forcemerge indices older than today
2018-01-22 06:40:32,896 INFO      forceMerging selected indices
2018-01-22 06:40:32,896 INFO      forceMerging index wsjiis-2018.01.21 to 1 segments per shard.  Please wait...
2018-01-22 06:46:10,170 INFO      Pausing for 120.0 seconds before continuing...
2018-01-22 06:48:10,257 INFO      Action ID: 2, "forcemerge" completed.

I can confirm forcemerging was successful after restoring it too:

{
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "indices" : {
    "wsjiis-2018.01.21" : {
      "shards" : {
        "0" : [
          {
            "routing" : {
              "state" : "STARTED",
              "primary" : true,
              "node" : "o8B-Y633SXKqLOA6hAKjaA"
            },
            "num_committed_segments" : 1,
            "num_search_segments" : 1,
            "segments" : {
              "_1v64" : {
                "generation" : 87052,
                "num_docs" : 14136042,
                "deleted_docs" : 0,
                "size_in_bytes" : 4795105794,
                "memory_in_bytes" : 9468622,
                "committed" : true,
                "search" : true,
                "version" : "7.1.0",
                "compound" : false,
                "attributes" : {
                  "Lucene50StoredFieldsFormat.mode" : "BEST_COMPRESSION"
                }
              }
            }
          }
        ],
        "1" : [
          {
            "routing" : {
              "state" : "STARTED",
              "primary" : true,
              "node" : "IjdD-k29Tbi7wqqUgRURXA"
            },
            "num_committed_segments" : 1,
            "num_search_segments" : 1,
            "segments" : {
              "_1vnd" : {
                "generation" : 87673,
                "num_docs" : 14140163,
                "deleted_docs" : 0,
                "size_in_bytes" : 4796632669,
                "memory_in_bytes" : 9474272,
                "committed" : true,
                "search" : true,
                "version" : "7.1.0",
                "compound" : false,
                "attributes" : {
                  "Lucene50StoredFieldsFormat.mode" : "BEST_COMPRESSION"
                }
              }
            }
          }
        ]
      }
    }
  }
}

BTW looks like the version of Lucene here is 7.1.0 I guess?

For a bit more context in case it's relevant, our daily backup procedure consists of:

  1. Optimize yesterday's indices with curator:
    • Remove replication
    • Forcemerge to 1 segment per shard
  2. Backup yesterday's indices with a shell script (inspiration for this):
    • Create a new NFS shared filesystem snapshot repository just for the one index
    • Create a snapshot of the index there
    • Delete the snapshot repository
    • Zip the files of the snapshot repository
  3. Delete indices more than a week old with curator:
    • Close
    • Delete

I recently ran into something that sounds similar, and upgrading to 6.1.3 fixed it.

That looks interesting thanks for the insight Loren. I probably won't be upgrading very soon (unless the issue appears aggressively) so I'll post again if I have anything useful.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.