Management threads blocked on TranslogWriter.syncUpTo

Hi, everyone.

I'm currently struggling with a weird issue after upgrading Elasticsearch from 5.6.4 to 6.0.0 GA.
I did a rolling upgrade and I'm now seeing threads from the management pool almost constantly blocked on org.elasticsearch.index.translog.TranslogWriter.syncUpTo:

Stack Trace
elasticsearch[elastic-fk-node01][management][T#1] [86] (BLOCKED)
   org.elasticsearch.index.translog.TranslogWriter.syncUpTo line: 328 
   org.elasticsearch.index.translog.TranslogWriter.sync line: 238 
   org.elasticsearch.index.translog.Translog.sync line: 670 
   org.elasticsearch.index.seqno.GlobalCheckpointSyncAction.shardOperationOnPrimary line: 119 
   org.elasticsearch.index.seqno.GlobalCheckpointSyncAction.shardOperationOnPrimary line: 51 
   org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform line: 1033 
   org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform line: 1011 
   org.elasticsearch.action.support.replication.ReplicationOperation.execute line: 104 
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse line: 358 
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse line: 298 
   org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse line: 974 
   org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse line: 971 
   org.elasticsearch.index.shard.IndexShardOperationPermits.acquire line: 238 
   org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit line: 2188 
   org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference line: 983 
   org.elasticsearch.action.support.replication.TransportReplicationAction.access$500 line: 97 
   org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun line: 319 
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run line: 37 
   org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived line: 294 
   org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived line: 281 
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived line: 66 
   org.elasticsearch.transport.TransportService$7.doRun line: 659 
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun line: 638 
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run line: 37 
   java.util.concurrent.ThreadPoolExecutor.runWorker line: 1149 
   java.util.concurrent.ThreadPoolExecutor$Worker.run line: 624 
   java.lang.Thread.run line: 748 

I've attached several screenshots from JMC that demonstrate the problem.

Also, here is a full call tree for a file write to one of the translog files: https://pastebin.com/raw/56ZSkB3u.

Could this be somehow connected to the new sequence number system in ES 6.0.0?
I've never had this issue before upgrading to 6.0.0.
This situation is pretty bad, because I cannot even query "_nodes/stats" - the request just times out.
I'm also seeing higher than usual CPU usage (compared to other Sundays) and the master node is constantly complaining about timed out requests:

[2017-12-03T09:12:33,564][WARN ][o.e.t.TransportService ] [elastic-fk-node01] Received response for a request that has timed out, sent [18278ms] ago, timed out [3253ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elastic-fk-node03}{ZwRuJpSMSNCgtMezxDVoPw}{Y_f-tf_cT-2IiLJJGH2h2w}{x.x.x.x}{x.x.x.x:9300}], id [50088016]

Here are some details about our cluster:

Elasticsearch version: 6.0.0 GA
Nodes: 3 (master+data+coordinating)
Heap size: 31GB
Java version: 1.8.0_144
OS: CentOS Linux release 7.4.1708
Kernel: 4.13.3-1.el7.elrepo.x86_64
Elasticsearch VMs specs: 16 vCPU, 64GB RAM

Our normal indexing speed during working hours is about 12000 docs per second with spikes up to 20000-22000 dps.
Right now I'm seeing around 5000-6000 docs per second, so the load on the cluster is not an issue.
There are currently 6750732469 documents (6.4TB including replicas), 260 indices and 1736 shards in the cluster.
Most indices have 3 shards and 1 replica.

Each index uses the following translog settings:

"translog" : {
  "flush_threshold_size" : "512mb",
  "sync_interval" : "15s",
  "durability" : "async"
},

The VMs are performing fine, CPU utilization is around 30% and there is plenty of available memory.
We've already tried lowering the flush threshold to 64mb, which didn't help at all.
I'd greatly appreciate any advice or guidance in this situation.

So I've decided to stop all indexing and do a full cluster restart as a last resort:

  1. Prevented any queries from reaching the cluster (flush and management threads continued to work pretty hard after this point).
  2. Set index.translog.retention.size to 0.
  3. Force merged and flushed all indices.
  4. Restarted all ES nodes.
  5. Set index.translog.retention.size back to 512mb.

The cluster seems to be working fine now, although it might be too early to say.

3 hours after the restart the issues are back again.
It appears that it gets worse with time.

How many primary shards are actively seeing writes?

How many primary shards are actively seeing writes?

51 primary shards are being actively written to at any given point in time (some more than others).
We're using the Elastic Stack to collect logs and some metrics, there are several indices (one for each log "type") and they all are using the default Logstash naming scheme (a new index created every day).

If this is of any relevance, I was also hit by issue #27536 but I fixed it by disabling replicas, force merging the indices that had this problem and re-enabling replicas afterwards.

Would you please enable trace logging on org.elasticsearch.index.shard.IndexShard, capture a flight recording during indexing that reflects the contention again, and share the output of the logger org.elasticsearch.index.shard.IndexShard from the primaries that was captured while this flight recording was captured. Also can we have the output of /_cluster/state/routing_table?

To be clear: a recording (as you have done above) and logs from a single node is sufficient.

Sure, here you go: http://jmp.sh/v/vREuVzOuwJOtJSGwv5lS.
Note that there is one index with replica shards in the initializing state. This is because I've recently enabled replicas for that index after having disabled them yesterday, so this is not a sign of cluster instability.

Meanwhile, I'll collect the trace log you requested and a new JFR.

I've sent you a PM with links to the flight recording and logs from the same node.
Hope this helps.

I've collected and sent you another set of files (from a different node), just in case the first one is not enough.

Thanks. This helps. I understand the problems here and will open a PR soon.

For now, I think that you should increase the size of the management thread pool (maybe equal to the number of cores that you have). This will at least reduce the likelihood that a cluster stats request times out because it could not get time on the management thread pool.

I've increased the size of the management pool up to the number of online cores, but I don't think it helped at all.
What's worse than not being able to query some endpoints from the cat and cluster APIs is the fact that Kibana's Elasticsearch plugin goes red because it cannot perform requests to _nodes.

{"type":"log","@timestamp":"2017-12-04T03:03:54Z","tags":["error","elasticsearch","admin"],"pid":16471,"message":"Request error, retrying\nGET http://elk-balancer:9201/_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip => socket hang up"}
{"type":"log","@timestamp":"2017-12-04T03:04:24Z","tags":["error","elasticsearch","admin"],"pid":16471,"message":"Request error, retrying\nGET http://elk-balancer:9201/_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip => socket hang up"}
{"type":"log","@timestamp":"2017-12-04T03:04:54Z","tags":["error","elasticsearch","admin"],"pid":16471,"message":"Request complete with error\nGET http://elk-balancer:9201/_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip => socket hang up"}

And I don't think there is much that can be done about this right now.
I'll see if I can maybe fake the responses to that request at the Nginx reverse proxy between Kibana and Elasticsearch.

So I've implemented a workaround for that specific request.

...
location @nodes_fake {
        return 200 '<original _nodes JSON response here>';
        add_header Content-Type application/json;
}

location @nodes_tribes_fake {
        return 200 '{}';
        add_header Content-Type application/json;
}
...
error_page 555 = @nodes_fake;
error_page 556 = @nodes_tribes_fake;
...
if ( $query_string ~ "filter_path=nodes.+http.publish_address" ) { return 555; }
if ( $query_string ~ "filter_path=nodes.+settings.tribe" ) { return 556; }
...

At least Kibana works now.

The management thread pools are not looking good, though. This is from the master node right now:

But I'm happy that we can at least query the cluster and the indexing/query performance seems to be unaffected.

Thank you for helping me troubleshoot this, we'll wait for that fix, I guess. :slightly_smiling_face:

An update: the cluster is now under its regular load and the performance hit has become noticeable.
Kibana mostly works with the workaround from my previous comment, except for TSVB: it fails to load because some of its requests time out and I do not see a way to fake the responses, since it needs actual data from the cluster.

To start I have opened:

We will get these out in a release soon and let us see how much it improves the situation before seeing if we need to take it further.

2 Likes

These changes are available in Elasticsearch 6.1.0 now. Would you please try this release and let us know if the situation is improved?

2 Likes

We've updated our stack to 6.1.0 and the situation is greatly improved. I see no contention on the management thread pool now:

Our cluster now runs without any problems. :relieved:

Thank you for looking into this and fixing the issue!

You’re welcome, thanks for the report, the diagnostics, and confirming the fix.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.