ES 7.5 translog recovery is extremely slow

A sample of the response from the recovery API:

"translog" : {
"recovered" : 17201,
"total" : 4686825,
"percent" : "0.4%",
"total_on_start" : -1,
"total_time_in_millis" : 1790702
}

This is on an index that is no longer being written to. The size is well below the cap, but to get 0.4% for the way through it has taken 30 minutes. At that rate, letting it complete would take far too long (although, admittedly, they seem to eventually go away on their own log before they would it that pace was maintained).

The shards for this index are all around 6GB. And when I get shard stats for the index

"translog" : {
"operations" : 0,
"size_in_bytes" : 55,
"uncommitted_operations" : 0,
"uncommitted_size_in_bytes" : 55,
"earliest_last_modified_age" : 4316474
}

Nothing about this index is special (every since moving to 7.5, I have seen this sort of translog slowness on many different indices). And I have not changed any translog settings away from their defaults.

To get rid of these parasitic translog operations, I have been reducing replica counts on the index to 0 (I thought reducing from 1 to 2 would work, but it always seems to keep the replica around that is being recovered from the translog), allow the recovery operations to be abandoned, and then I increase the replica count back to its original value. At that point, the translog size is reported as 0, and recovery happens very quickly.

From my perspective, we would be better off if the translog did not exist, and we instead just recovered by copying the raw data. Since, translog recovery has always been slow for us, but never this bad. Is there something about the new "soft deletes" (our old cluster was on ES 6.4) that I am not understanding that could be causing this? Assuming that might be to blame, I am tempted to either disable them, or dramatically reduce the lease period.

1 Like

Could you share the index settings (GET /index/_settings), its detailed stats (GET /index/_stats?level=shards) and the complete recovery output too? Use https://gist.github.com/ since that won't fit here.

The recovery info filtered to active only seemed small enough, so I included it here. I removed node and index names, but the only thing interesting there is that the target node for both of these recoveries was the same.

{
  "index2" : {
    "shards" : [
      {
        "id" : 23,
        "type" : "PEER",
        "stage" : "TRANSLOG",
        "primary" : false,
        "start_time_in_millis" : 1579283162449,
        "total_time_in_millis" : 1790028,
        "source" : {
          "id" : "",
          "host" : "",
          "transport_address" : "",
          "ip" : "",
          "name" : ""
        },
        "target" : {
          "id" : "",
          "host" : "",
          "transport_address" : "",
          "ip" : "",
          "name" : ""
        },
        "index" : {
          "size" : {
            "total_in_bytes" : 0,
            "reused_in_bytes" : 0,
            "recovered_in_bytes" : 0,
            "percent" : "0.0%"
          },
          "files" : {
            "total" : 0,
            "reused" : 0,
            "recovered" : 0,
            "percent" : "0.0%"
          },
          "total_time_in_millis" : 35,
          "source_throttle_time_in_millis" : 0,
          "target_throttle_time_in_millis" : 0
        },
        "translog" : {
          "recovered" : 2536,
          "total" : 47362814,
          "percent" : "0.0%",
          "total_on_start" : -1,
          "total_time_in_millis" : 1789982
        },
        "verify_index" : {
          "check_index_time_in_millis" : 0,
          "total_time_in_millis" : 0
        }
      }
    ]
  },
  "index1" : {
    "shards" : [
      {
        "id" : 3,
        "type" : "PEER",
        "stage" : "TRANSLOG",
        "primary" : false,
        "start_time_in_millis" : 1579283161715,
        "total_time_in_millis" : 1790771,
        "source" : {
          "id" : "-UBHqkQ",
          "host" : "",
          "transport_address" : "",
          "ip" : "",
          "name" : ""
        },
        "target" : {
          "id" : "",
          "host" : "",
          "transport_address" : "",
          "ip" : "",
          "name" : ""
        },
        "index" : {
          "size" : {
            "total_in_bytes" : 0,
            "reused_in_bytes" : 0,
            "recovered_in_bytes" : 0,
            "percent" : "0.0%"
          },
          "files" : {
            "total" : 0,
            "reused" : 0,
            "recovered" : 0,
            "percent" : "0.0%"
          },
          "total_time_in_millis" : 31,
          "source_throttle_time_in_millis" : 0,
          "target_throttle_time_in_millis" : 0
        },
        "translog" : {
          "recovered" : 17201,
          "total" : 4686825,
          "percent" : "0.4%",
          "total_on_start" : -1,
          "total_time_in_millis" : 1790702
        },
        "verify_index" : {
          "check_index_time_in_millis" : 0,
          "total_time_in_millis" : 0
        }
      }
    ]
  }
}

And the index settings:

{
  "index1" : {
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "total_shards_per_node" : "1"
          }
        },
        "refresh_interval" : "1m",
        "number_of_shards" : "30",
        "translog" : {
          "durability" : "async"
        },
        "provided_name" : "index1",
        "creation_date" : "1579280534054",
        "unassigned" : {
          "node_left" : {
            "delayed_timeout" : "10m"
          }
        },
        "number_of_replicas" : "2",
        "uuid" : "EsBjp6Y8Q8ebQW0ms4ZQYA",
        "version" : {
          "created" : "7050099"
        }
      }
    }
  }
}

These indices are no longer recovering (one completed about an hour in, the other I cancled by reducing replicas to 0), so I can't get you the full shard stats. Although, I can tell you that when I looked at it during the issue, there were no big numbers in any of the translog sections. the "operations" value for every one was 0, and the sizes were all very low (they certainly did not add up to 4 mb). Once it happens again, however, I could include that in a gist, if it would be useful.

Thanks

Ah, yes I think we want this info on a shard that's currently recovering slowly. By all means redact the host names and IP addresses but please leave the random node IDs intact.

Would you also be able to set logger.org.elasticsearch.indices.recovery to TRACE via the cluster settings API? This might result in quite a lot of output so be ready to set it back to null if it seems to be stressing your cluster.

"total" : 4686825

Also, how long was the replica offline (while the indexing continued)?

Also, how long was the replica offline (while the indexing continued)?

I'm not sure I understand the question. This recovery operation was to bring the index back to a green state, so this replica was offline for the duration of the recovery. At the time I collected that data, that was about 30 minutes. I've recently seen them approach an hour. After that, I am unsure if they tend to speed up at some point, or if I have just not noticed them.

The last time we set nodes to TRACE, the output to disk actually managed to overwhelm our hardware, which caused problems for the node, and prompted it to output even more to disk. I could look at doing this again, but maybe as a last resort?

Noted. Do you mean the last time you tried tracing recoveries only (as I suggested here) or were you tracing with a broader scope? If you try and trace the transport layers, for instance, then you're going to have a bad time. In any case it's ok not to add logging for now and hope that the stats have some clues.

Ah, that's fair. Last time we enabled TRACE globally. Scoping it down to a specific scope would probably be fine.

Here's a gist with the info you requested:

Thanks, that's helpful. Here are some edited highlights:

    "commit": {
      "user_data": {
        "local_checkpoint": "212661",
        "min_retained_seq_no": "178451",
        "max_seq_no": "212732"
      },
      "num_docs": 47198999
    },
    "seq_no": {
      "max_seq_no": 215715,
      "local_checkpoint": 215711,
      "global_checkpoint": 215579
    },
    "retention_leases": {
      "primary_term": 1,
      "version": 287,
      "leases": [
        {
          "id": "peer_recovery/up6P4q_ARIO1W7Wewt5s7g",
          "retaining_seq_no": 178451,
          "timestamp": 1579646940678,
          "source": "peer recovery"
        },
        ...
      ]
    },

In particular "max_seq_no": 215715 means that this shard only contains 215715 operations. The target of the recovery in question is up6P4q_ARIO1W7Wewt5s7g and the lease for that node has "retaining_seq_no": 178451 indicating that the target is missing 37264 of them. But there is some disagreement: "num_docs": 47198999. I think this means you have a lot of nested documents?

Looking at the recovery stats:

        "translog" : {
          "recovered" : 16507,
          "total" : 2870006,
          "percent" : "0.6%",
          "total_on_start" : -1,
          "total_time_in_millis" : 519729
        },

The total is supposed to be the total number of operations to recover, which should be in the region of 37264, but is instead much larger than the total number of operations in this shard. I suspect we might be calculating the total wrong in the presence of nested documents, maybe counting documents instead of operations, and in fact this recovery need only complete ~37000 operations so it's around 44% complete. If you let it run for another 10-15 minutes does it finish? @nhat does that sound like a plausible explanation to you?

Since this replica is missing ~17% of the operations in this shard Elasticsearch would normally consider it more efficient simply to copy the files over again, but unfortunately for tricky technical reasons if you're using nested documents then that calculation is skewed in favour of replaying the translog.

1 Like

You are absolutely correct that we are using a lot of Nested documents in this index.

I generally see them finish somewhere between 30 and 60 minutes.

Ah. That's interesting. It explains why that calculation always seemed so far off.

Is there any way for me to configure the cluster to force it to always favor copying files over (or better yet, just change the math in that calculation)? So far as I can tell, a non-translog recovery is always much faster than a translog one for us. I had a thought of reducing the retention on those logs, but I'm not sure if that would do what I wanted, and I do not know if I would risk data loss by reducing it too far.

There's no config available to influence the ops-vs-files decision directly, but you can indeed achieve a similar effect by reducing index.soft_deletes.retention_lease.period on these heavily-nested indices, especially if you're not using cross-cluster replication. This setting won't cause data loss however it's set. The lower you set it, the shorter the time a node can disconnect and reconnect and catch up without copying everything over. You probably want ops-based recoveries if a node is only offline for a few seconds or minutes, but the default of 12h may be too long for your case.

Addendum (edit): I recommend setting index.soft_deletes.retention_lease.period back to the default when you stop writing to each index, assuming these indices are time-based. Once the index is read-only you should heavily prefer ops-based recoveries since replaying zero missing operations will certainly be faster than copying the data again.

@DavidTurner Great find. Your analysis makes a lot of sense to me. Yes, we can make LuceneChangesSnapshot report an accurate number of operations and faster with nested documents.

I opened https://github.com/elastic/elasticsearch/pull/51279.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.