Elasticsearch 6.8.6 pending_tasks grows after 24 hours

Chris_Denneen · April 2, 2020, 6:41pm

CentOS 7.7
Elasticsearch 6.8.6
OpenJDK Runtime Environment (build 1.8.0_242-b08)

    # curl -s localhost:9200/_cluster/health?pretty
    {
      "cluster_name" : "dev-gi-logging",
      "status" : "green",
      "timed_out" : false,
      "number_of_nodes" : 1,
      "number_of_data_nodes" : 1,
      "active_primary_shards" : 190,
      "active_shards" : 190,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 48,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 2049963,
      "active_shards_percent_as_number" : 100.0
    }

Dev single node cluster
Running in Amazon
Dedicated 100GB EBS gp2 volume

IO Scheduler - [none] mq-deadline kyber

The node requires being restarted/reloaded after 24-48 hours.
This only started happening after enabling ILM which is enabled from index templates.

    # iostat 2 6 nvme1n1
    Linux 3.10.0-1062.12.1.el7.x86_64 (ip-10-16-13-67) 	04/02/2020 	_x86_64_	(8 CPU)

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              33.33    0.02    0.55    0.27    0.00   65.82

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1          65.36        41.98       971.89  189137431 4378439603

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.08    0.00    0.63    0.06    0.00   85.23

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              16.40    0.00    0.38    0.00    0.00   83.23

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.50         0.00         6.00          0         12

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.56    0.00    0.38    0.00    0.00   85.06

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              16.59    0.00    0.88    0.00    0.00   82.53

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              15.70    0.00    0.50    0.00    0.00   83.80

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

DavidTurner · April 3, 2020, 9:27am

You shared your hot threads in another channel:

::: {ip-10-16-13-67-master}{MStdCN0_TSihH2muPN7NcQ}{zDNK_-UIQhaJy6tLBwCAqg}{10.240.53.104}{10.240.53.104:9300}{ml.machine_memory=32905895936, node_type=hot, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2020-04-02T18:11:54.911Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   86.6% (433ms out of 500ms) cpu usage by thread 'elasticsearch[ip-10-16-13-67-master][clusterApplierService#updateTask][T#1]'
     9/10 snapshots sharing following 21 elements
       org.apache.logging.log4j.core.Logger$PrivateConfig.<init>(Logger.java:383)
       org.apache.logging.log4j.core.Logger.updateConfiguration(Logger.java:365)
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:654)
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:643)
       org.apache.logging.log4j.core.config.Configurator.setLevel(Configurator.java:296)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:129)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:111)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:163)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:121)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:609)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater$$Lambda$4307/2017848792.run(Unknown Source)
       org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:190)
       org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:461)
       org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:419)
       org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:163)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:655)
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:643)
       org.apache.logging.log4j.core.config.Configurator.setLevel(Configurator.java:296)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:129)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:111)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:163)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:121)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:609)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater$$Lambda$4307/2017848792.run(Unknown Source)
       org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:190)
       org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:461)
       org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:419)
       org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:163)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

Your master is spending all its time configuring loggers? I don't think I've ever seen that before. Do you have an unusually complicated logging configuration?

Chris_Denneen · April 3, 2020, 10:19am

Nothing complex. One thing I did notice was a filebeat index trying to create repeatedly and was failing due to “doc” type and it appeared to be caused by old filebeat* template. After removing that template the index created fine. Wondering if that was the underlying issue.
I also mentioned in that channel seeing a lot of “authorization is not allowed”. Now while I know that not in DEBUG this isn’t actually being logged but doesn’t mean the context switching still isn’t happening. How do you stop this “noise” from happening when using FOSS model?

DavidTurner · April 3, 2020, 10:31am

Oh yes that would explain some things:

# curl -s localhost:9200/_cat/pending_tasks?h=source | sort | uniq -c
      2 create-index [filebeat-6.6.2-2020.04.02], cause [auto(bulk api)]
    691 ilm-execute-cluster-state-steps
    216 ilm-set-step-info

create-index runs at priority URGENT whereas mapping updates are HIGH and the ILM tasks are NORMAL, so if you spam your cluster hard enough with create-index requests (including bulks that need to create indices) then that can cause starvation for other tasks.

I'm still surprised that we saw the master spending 500+ms updating loggers. Looks like you've set logger._root dynamically -- might be worth removing that setting.

The message not opting out of the query cache; authorization is not allowed doesn't appear in the OSS distribution. I don't know what context-switching you are asking about.

Chris_Denneen · April 3, 2020, 1:01pm

Yes when debug logging for _root was enabled I am flooded with these messages and the entire setup is OSS so not sure how it “doesn’t appear in OSS”?

[2020-04-02T13:51:46,726][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.reporting-2019.05.12] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:51:46,726][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_task_manager] not opting out of the query cache; authorization is not allowed
2019.05.05] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:51:46,727][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.reporting-2019.05.12] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:51:46,727][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.reporting-2019.05.12] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [shard.index][106] took [573micros]
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [shard.state][3] took [163.4micros]
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [shard.node][1] took [70.5micros]
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [source_node.name][1] took [77.2micros]
[2020-04-02T13:54:25,559][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [source_node.uuid][1] took [76.7micros]
[2020-04-02T13:54:25,559][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.monitoring-es-6-2020.04.02] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed

DavidTurner · April 3, 2020, 1:17pm

The only place this message appears in the Elasticsearch codebase is here, which is not included in the OSS distribution:

github.com

elastic/elasticsearch/blob/v6.8.6/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authz/accesscontrol/OptOutQueryCache.java#L90


public void clear(final String reason) {
    assert licenseStateListenerRegistered;
    logger.debug("full cache clear, reason [{}]", reason);
    indicesQueryCache.clearIndex(index().getName());
}


@Override
public Weight doCache(Weight weight, QueryCachingPolicy policy) {
    assert licenseStateListenerRegistered;
    if (licenseState.isAuthAllowed() == false) {
        logger.debug("not opting out of the query cache; authorization is not allowed");
        return indicesQueryCache.doCache(weight, policy);
    }


    IndicesAccessControl indicesAccessControl = context.getTransient(
            AuthorizationServiceField.INDICES_PERMISSIONS_KEY);
    if (indicesAccessControl == null) {
        logger.debug("opting out of the query cache. current request doesn't hold indices permissions");
        return weight;
    }

What does GET / return on this node?

Chris_Denneen · April 3, 2020, 2:08pm

{
  "name" : "ip-10-16-13-67-master",
  "cluster_name" : "dev-gi-logging",
  "cluster_uuid" : "zgBS2UDcTh2BY4mMKsE0Lg",
  "version" : {
    "number" : "6.8.7",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "c63e621",
    "build_date" : "2020-02-26T14:38:01.193138Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.2",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

DavidTurner · April 3, 2020, 2:20pm

This is the default distribution, not the OSS distribution. The default distribution includes all the free features including those that are not purely OSS.

Coming back to the question of context switching, what exactly is your concern there? It sounds like you are worried about the cost of a DEBUG-level log when DEBUG-level logging is disabled. I don't think that involves any context-switching, it should be very cheap.

Chris_Denneen · April 3, 2020, 2:33pm

Not sure where the OSS package is then:

# rpm -q elasticsearch
elasticsearch-6.8.7-1.noarch

# cat /etc/yum.repos.d/elastic.repo
[elastic]
name=Elastic package repository.
baseurl=https://artifacts.elastic.co/packages/6.x/yum
enabled=1
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch

Yes I'm leveraging the free features of x-pack but shouldn't there be something to enabled in elasticsearch.yml to state using the "free" license and circumvent any of this code?

The sheer amount of volume of these messages that spits out when DEBUG is enabled is what I am concerned about. Has to add up. Without DEBUG doesn't mean the code isn't run just means the line logger.debug("not opting out of the query cache; authorization is not allowed"); isn't but the licenseState.isAuthAllowed() is still run unnecessarily.

If I didn't see it so frequently in the log for .kibana2, .reporting, etc indexes then I would agree it should be cheap but that frequently has to have some performance hit?

Chris_Denneen · April 3, 2020, 2:39pm

only -oss package I see in that repository is kibana-oss none for elasticsearch, logstash or any others

DavidTurner · April 3, 2020, 2:55pm

According to https://www.elastic.co/guide/en/elasticsearch/reference/6.8/rpm.html if you really only want the OSS distribution then you should use this:

baseurl=https://artifacts.elastic.co/packages/oss-6.x/yum

Unless you've evidence from a profiler that we're spending an excessive amount of time in licenseState.isAuthAllowed() I don't think this is cause for concern, no. This code is part of the security functionality which is included in the free basic license and enabled by default, hence why it's being called.

It looks like you can completely disable security by setting xpack.security.enabled: false which should make this logging go away, but I don't expect that to have much impact on anything.

Setting logger._root: DEBUG is rarely useful and not recommended. DEBUG-level logging is often only meaningful when read alongside the code itself, and even then you normally want to focus on a small number of packages. There's no reason to expect this level of logging to have a manageable volume or any actionable content.

Chris_Denneen · April 3, 2020, 3:05pm

Thank You!
If I switch to OSS packages I lose the xpack monitoring correct?
I will look at the xpack.security.enabled: false setting to see if that helps but to your point my original issue was probably the filebeat* template and the URGENT create index causing issues.

DavidTurner · April 3, 2020, 4:06pm

Correct.

system · May 1, 2020, 4:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster have millions of pending tasks Elasticsearch	15	1141	June 8, 2021
Pending tasks queue Elasticsearch	8	3367	July 5, 2017
Increasing number of pending tasks despite small number of shards Elasticsearch	4	1119	June 23, 2021
Each Concurrent shards batch takes more than an hour to get allocated after the node left Elasticsearch	8	2157	July 5, 2017
Search and bulk transport tasks stuck running for days on ES 7.17.5 Elasticsearch	9	594	January 16, 2023

Elasticsearch 6.8.6 pending_tasks grows after 24 hours

Related topics