Elasticsearch 6.8.6 pending_tasks grows after 24 hours

CentOS 7.7
Elasticsearch 6.8.6
OpenJDK Runtime Environment (build 1.8.0_242-b08)

    # curl -s localhost:9200/_cluster/health?pretty
    {
      "cluster_name" : "dev-gi-logging",
      "status" : "green",
      "timed_out" : false,
      "number_of_nodes" : 1,
      "number_of_data_nodes" : 1,
      "active_primary_shards" : 190,
      "active_shards" : 190,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 48,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 2049963,
      "active_shards_percent_as_number" : 100.0
    }

Dev single node cluster
Running in Amazon
Dedicated 100GB EBS gp2 volume

IO Scheduler - [none] mq-deadline kyber

The node requires being restarted/reloaded after 24-48 hours.
This only started happening after enabling ILM which is enabled from index templates.

    # iostat 2 6 nvme1n1
    Linux 3.10.0-1062.12.1.el7.x86_64 (ip-10-16-13-67) 	04/02/2020 	_x86_64_	(8 CPU)

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              33.33    0.02    0.55    0.27    0.00   65.82

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1          65.36        41.98       971.89  189137431 4378439603

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.08    0.00    0.63    0.06    0.00   85.23

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              16.40    0.00    0.38    0.00    0.00   83.23

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.50         0.00         6.00          0         12

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              14.56    0.00    0.38    0.00    0.00   85.06

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              16.59    0.00    0.88    0.00    0.00   82.53

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              15.70    0.00    0.50    0.00    0.00   83.80

    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    nvme1n1           0.00         0.00         0.00          0          0

You shared your hot threads in another channel:

::: {ip-10-16-13-67-master}{MStdCN0_TSihH2muPN7NcQ}{zDNK_-UIQhaJy6tLBwCAqg}{10.240.53.104}{10.240.53.104:9300}{ml.machine_memory=32905895936, node_type=hot, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2020-04-02T18:11:54.911Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   86.6% (433ms out of 500ms) cpu usage by thread 'elasticsearch[ip-10-16-13-67-master][clusterApplierService#updateTask][T#1]'
     9/10 snapshots sharing following 21 elements
       org.apache.logging.log4j.core.Logger$PrivateConfig.<init>(Logger.java:383)
       org.apache.logging.log4j.core.Logger.updateConfiguration(Logger.java:365)
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:654)
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:643)
       org.apache.logging.log4j.core.config.Configurator.setLevel(Configurator.java:296)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:129)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:111)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:163)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:121)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:609)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater$$Lambda$4307/2017848792.run(Unknown Source)
       org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:190)
       org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:461)
       org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:419)
       org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:163)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:655)
       org.apache.logging.log4j.core.LoggerContext.updateLoggers(LoggerContext.java:643)
       org.apache.logging.log4j.core.config.Configurator.setLevel(Configurator.java:296)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:129)
       org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:111)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:163)
       org.elasticsearch.common.settings.ClusterSettings$LoggingSettingUpdater.apply(ClusterSettings.java:121)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:609)
       org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater$$Lambda$4307/2017848792.run(Unknown Source)
       org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:190)
       org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:461)
       org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:419)
       org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:163)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

Your master is spending all its time configuring loggers? I don't think I've ever seen that before. Do you have an unusually complicated logging configuration?

Nothing complex. One thing I did notice was a filebeat index trying to create repeatedly and was failing due to “doc” type and it appeared to be caused by old filebeat* template. After removing that template the index created fine. Wondering if that was the underlying issue.
I also mentioned in that channel seeing a lot of “authorization is not allowed”. Now while I know that not in DEBUG this isn’t actually being logged but doesn’t mean the context switching still isn’t happening. How do you stop this “noise” from happening when using FOSS model?

Oh yes that would explain some things:

# curl -s localhost:9200/_cat/pending_tasks?h=source | sort | uniq -c
      2 create-index [filebeat-6.6.2-2020.04.02], cause [auto(bulk api)]
    691 ilm-execute-cluster-state-steps
    216 ilm-set-step-info

create-index runs at priority URGENT whereas mapping updates are HIGH and the ILM tasks are NORMAL, so if you spam your cluster hard enough with create-index requests (including bulks that need to create indices) then that can cause starvation for other tasks.

I'm still surprised that we saw the master spending 500+ms updating loggers. Looks like you've set logger._root dynamically -- might be worth removing that setting.

The message not opting out of the query cache; authorization is not allowed doesn't appear in the OSS distribution. I don't know what context-switching you are asking about.

Yes when debug logging for _root was enabled I am flooded with these messages and the entire setup is OSS so not sure how it “doesn’t appear in OSS”?

[2020-04-02T13:51:46,726][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.reporting-2019.05.12] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:51:46,726][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_task_manager] not opting out of the query cache; authorization is not allowed
2019.05.05] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:51:46,727][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.reporting-2019.05.12] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:51:46,727][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.reporting-2019.05.12] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [shard.index][106] took [573micros]
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [shard.state][3] took [163.4micros]
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [shard.node][1] took [70.5micros]
[2020-04-02T13:54:25,558][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [source_node.name][1] took [77.2micros]
[2020-04-02T13:54:25,559][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] [ip-10-16-13-67-master] global-ordinals [source_node.uuid][1] took [76.7micros]
[2020-04-02T13:54:25,559][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.monitoring-es-6-2020.04.02] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed
[2020-04-02T13:54:23,612][DEBUG][o.e.x.s.a.a.OptOutQueryCache] [ip-10-16-13-67-master] [.kibana_2] not opting out of the query cache; authorization is not allowed

The only place this message appears in the Elasticsearch codebase is here, which is not included in the OSS distribution:

What does GET / return on this node?

{
  "name" : "ip-10-16-13-67-master",
  "cluster_name" : "dev-gi-logging",
  "cluster_uuid" : "zgBS2UDcTh2BY4mMKsE0Lg",
  "version" : {
    "number" : "6.8.7",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "c63e621",
    "build_date" : "2020-02-26T14:38:01.193138Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.2",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

This is the default distribution, not the OSS distribution. The default distribution includes all the free features including those that are not purely OSS.

Coming back to the question of context switching, what exactly is your concern there? It sounds like you are worried about the cost of a DEBUG-level log when DEBUG-level logging is disabled. I don't think that involves any context-switching, it should be very cheap.

Not sure where the OSS package is then:

# rpm -q elasticsearch
elasticsearch-6.8.7-1.noarch

# cat /etc/yum.repos.d/elastic.repo
[elastic]
name=Elastic package repository.
baseurl=https://artifacts.elastic.co/packages/6.x/yum
enabled=1
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch

Yes I'm leveraging the free features of x-pack but shouldn't there be something to enabled in elasticsearch.yml to state using the "free" license and circumvent any of this code?

The sheer amount of volume of these messages that spits out when DEBUG is enabled is what I am concerned about. Has to add up. Without DEBUG doesn't mean the code isn't run just means the line logger.debug("not opting out of the query cache; authorization is not allowed"); isn't but the licenseState.isAuthAllowed() is still run unnecessarily.

If I didn't see it so frequently in the log for .kibana2, .reporting, etc indexes then I would agree it should be cheap but that frequently has to have some performance hit?

only -oss package I see in that repository is kibana-oss none for elasticsearch, logstash or any others

According to https://www.elastic.co/guide/en/elasticsearch/reference/6.8/rpm.html if you really only want the OSS distribution then you should use this:

baseurl=https://artifacts.elastic.co/packages/oss-6.x/yum

Unless you've evidence from a profiler that we're spending an excessive amount of time in licenseState.isAuthAllowed() I don't think this is cause for concern, no. This code is part of the security functionality which is included in the free basic license and enabled by default, hence why it's being called.

It looks like you can completely disable security by setting xpack.security.enabled: false which should make this logging go away, but I don't expect that to have much impact on anything.

Setting logger._root: DEBUG is rarely useful and not recommended. DEBUG-level logging is often only meaningful when read alongside the code itself, and even then you normally want to focus on a small number of packages. There's no reason to expect this level of logging to have a manageable volume or any actionable content.

Thank You!
If I switch to OSS packages I lose the xpack monitoring correct?
I will look at the xpack.security.enabled: false setting to see if that helps but to your point my original issue was probably the filebeat* template and the URGENT create index causing issues.

Correct.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.