Issue with thread_pool.write.queue_size

Hi

He have been experiencing issues the last days on all nodes of our ES cluster, we see lots of these messages:

[2024-03-05T01:07:37,648][WARN ][o.e.x.m.e.l.LocalExporter] [elastic-loggerpro05] unexpected error while indexing monitoring document

org.elasticsearch.xpack.monitoring.exporter.ExportException: RemoteTransportException[[elastic-loggerpro01][10.57.40.167:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[elastic-loggerpro01][10.57.40.167:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.replication.TransportWriteAction$1/WrappedActionListener{org.elasticsearch.action.support.replication.ReplicationOperation$$Lambda$8539/0x0000000801d02d78@4306d4fb}{org.elasticsearch.action.support.replication.ReplicationOperation$$Lambda$8542/0x0000000801d033a8@3dbfcfc8} on EsThreadPoolExecutor[name = elastic-loggerpro01/write, queue capacity = 10000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@42a74d90[Running, pool size = 80, active threads = 80, queued tasks = 10014, completed tasks = 23396146]]];

They happen between 01:00 and 02:00 aprox, when some of the new indexes are created (1000 indexes per day aprox) when coming from a td-agent.
These indexes are not created when these errors appear.

We updated the setting:

thread_pool.write.queue_size

on all nodes, setting it to 50000; also, I started monitoring the write queue_size on all nodes, using the query:

/_cat/thread_pool?v&h=node_name,name,max,pool_size,size,type,queue_size,active,queue

After the parameter change, queue values increased to 20.000 on some nodes, and in only one node, the size went to the limit (50000).

This issue hasn't happened before last week (but I don't know about the queue size as I wasn't monitoring it).

Last night, all queues were low (maximum 2000).

We don't know what's causing this issue; I'm monitoring cpu, load, memory and network usage and everything seems ok (no increases last week).
The only change was a switches upgrade on the network, but I'm not sure if it's just a coincidence.
Do you know the reasons for this queue increase?

Our servers:
5 ES node, version 7.17.5
128 GB RAM each node
64 cpu each node, Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz

Thanks

What is the full output of the cluster stats API?

It sounds like you may have a very large number of very small indices, which is generally very, very inefficient and suboptimal. Why do you create so many indices every day?

Elasticsearch is often limited by storage performance, so knowing what type of storage you are using is critical. Creating a large number of new indices will result in a lot of changes to the cluster state, which need to be persisted to disk across the cluster so storage performance is important also for this. Are you using local SSDs?

cluster stats api:

{
  "_nodes" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "cluster_name" : "elastic-loggerpro-cluster",
  "cluster_uuid" : "GGwhiCgqSdGvZADnEiIb_w",
  "timestamp" : 1709814575017,
  "status" : "green",
  "indices" : {
    "count" : 12606,
    "shards" : {
      "total" : 25677,
      "primaries" : 12606,
      "replication" : 1.0368871965730604,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 3,
          "avg" : 2.0368871965730606
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 2.0,
          "avg" : 1.0368871965730604
        }
      }
    },
    "docs" : {
      "count" : 672939548,
      "deleted" : 9595435
    },
    "store" : {
      "size_in_bytes" : 453054182328,
      "total_data_set_size_in_bytes" : 453054182328,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 7583848,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 2271639957,
      "total_count" : 26936327,
      "hit_count" : 1853459,
      "miss_count" : 25082868,
      "cache_size" : 62847,
      "cache_count" : 122819,
      "evictions" : 59972
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 144264,
      "memory_in_bytes" : 757508310,
      "terms_memory_in_bytes" : 584512704,
      "stored_fields_memory_in_bytes" : 72147904,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 54678144,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 46169558,
      "index_writer_memory_in_bytes" : 67620232,
      "version_map_memory_in_bytes" : 1765330,
      "fixed_bit_set_memory_in_bytes" : 110352152,
      "max_unsafe_auto_id_timestamp" : 1709813872465,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 2327,
          "index_count" : 27,
          "script_count" : 0
        },
        {
          "name" : "binary",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 605,
          "index_count" : 87,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 10,
          "index_count" : 4,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 13825,
          "index_count" : 12592,
          "script_count" : 0
        },
        {
          "name" : "date_nanos",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "date_range",
          "count" : 3,
          "index_count" : 3,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 3,
          "index_count" : 3,
          "script_count" : 0
        },
        {
          "name" : "double_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 200,
          "index_count" : 25,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 334,
          "index_count" : 39,
          "script_count" : 0
        },
        {
          "name" : "float_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 206,
          "index_count" : 30,
          "script_count" : 0
        },
        {
          "name" : "geo_shape",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 58,
          "index_count" : 16,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 6603,
          "index_count" : 6403,
          "script_count" : 0
        },
        {
          "name" : "integer_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 5989,
          "index_count" : 5663,
          "script_count" : 0
        },
        {
          "name" : "ip_range",
          "count" : 3,
          "index_count" : 3,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 194265,
          "index_count" : 12595,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 4070,
          "index_count" : 119,
          "script_count" : 0
        },
        {
          "name" : "long_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 250,
          "index_count" : 36,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 7552,
          "index_count" : 592,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 25,
          "index_count" : 25,
          "script_count" : 0
        },
        {
          "name" : "shape",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "short",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 91064,
          "index_count" : 12583,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 4,
          "index_count" : 4,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.17.5",
        "index_count" : 12606,
        "primary_shard_count" : 12606,
        "total_primary_bytes" : 226518517486
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 5,
      "coordinating_only" : 0,
      "data" : 5,
      "data_cold" : 5,
      "data_content" : 5,
      "data_frozen" : 5,
      "data_hot" : 5,
      "data_warm" : 5,
      "ingest" : 5,
      "master" : 5,
      "ml" : 0,
      "remote_cluster_client" : 5,
      "transform" : 5,
      "voting_only" : 0
    },
    "versions" : [
      "7.17.5"
    ],
    "os" : {
      "available_processors" : 368,
      "allocated_processors" : 368,
      "names" : [
        {
          "name" : "Linux",
          "count" : 5
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Debian GNU/Linux 11 (bullseye)",
          "count" : 5
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 5
        }
      ],
      "mem" : {
        "total_in_bytes" : 877663375360,
        "free_in_bytes" : 226508640256,
        "used_in_bytes" : 651154735104,
        "free_percent" : 26,
        "used_percent" : 74
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 20
      },
      "open_file_descriptors" : {
        "min" : 41660,
        "max" : 42355,
        "avg" : 42041
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 180836906,
      "versions" : [
        {
          "version" : "18.0.1.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "18.0.1.1+2-6",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 5
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 49540415888,
        "heap_max_in_bytes" : 128849018880
      },
      "threads" : 2781
    },
    "fs" : {
      "total_in_bytes" : 22941210345472,
      "free_in_bytes" : 22469867352064,
      "available_in_bytes" : 21309850382336
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 5
      },
      "http_types" : {
        "security4" : 5
      }
    },
    "discovery_types" : {
      "zen" : 5
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 5
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 22,
      "processor_stats" : {
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "convert" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set_security_user" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

Why do you create so many indices every day?

Well is a customer requirement due to the software used to write and read those indexes. The average index size is 10 MB.

Are you using local SSDs?

I'm not sure about that, let me ask the guys at the servers dpt.

But just for your information... this cluster has been working during 3 years without any failure, it has been working since then. No changes, no load increase, nothing (except the switch changes, which I think that's not the cause for the problem - and there are people checking them already)

There is a limit of 1000 shards per node which is there for a very good reason. You have overridden this and have over 5000 shards per node. This is not good.

I would recommend changing this as this is not good. I am not familiar with td-agent so do not know what flexibility there is. I would generally recommend consolidating indices and only split indices based on the type of data and/or retention period, not server or service names etc. This could potentially reduce the number of indices significantly and improve the stability of the system.

If the indexing process can be configured to index into a single index without a date pattern, you could also create data streams and index into these. With data streams you index into an alias, which looks like a single index but is backed by multiple time-based indices behind the scenes. These roll over based on size and/or age, which means that new indices are created throughut the day and not all at the same time.

I have seen a few clusters with very large shard counts over the years and they all seem to work well up to a point where they all of a sudden don't. Often the shard count has been slowly increasing over time until the cluster reached a tipping point. At that point the number of shards somethimes make it very hard or even impossible to rectify the problems and I have seen clusters where all or a majority of the data was lost.

It is also worth noting that improvements to how Elasticsearch handle vary large shard counts have been made in version 8.x, so I would recommend you upgrade to the latest version.

1 Like

Hi

Thanks for all the info.

About storage, 3 nodes have local storage SAS SSD on Raid 5.
The other 2 nodes are virtual machines (vmware) with iSCSI remote disks

Cheers

Making lots of changes to the cluster state means the cluster state will need to be updated frequenly and will result in a lot of writes. The slowest node will determine the pace, and while this happens data being written to the indices being created will need to be queued up. I would recommend having the same stoarge for all nodes as Elasticsearch by default assumes all nodes are equal.

Thanks again.

I have retention for all my indexes of 30 days. If I close indexes older than 7 days for example, would that improve performance and maybe solve my issues?

If I need an index older than 7 days I only need to reopen it first

Cheers

Closing indices would only potentially reduce the amount of heap used, so not help here.

Ok
What If I add nodes to the cluster? Then the shards per node will decrease...

I do not think the issue in itself is the number of shards per node even though this is very high. Based on your description the issue seems to be that you are creating a very large number of new indices in a very short period of time. This will not be solved by increasing the node count. As far as I can tell there are 2 ways (not necessarily exclusive) to resolve the issue, which both rely on changing how you index data.

  1. Spread out when new indices are created over time by adopting rollover and allow indices to handle a more flexible time period. This also would allow you to have indices cover longer time periods and reduce the shard count.
  2. Consolidate indices so there are fewer indices that need to be created per day. This will also reduce the over all shard count, which is recommended.

Apart from that you can try to limit the symptoms by ensuring that storage performance is not a limiting factor at the times the indices are being created.

I could try pre-creating those scripts hours before, with some delays every new creation. That's how I did in the past when I had Elasticsearch version 1.6
That will solve the issue?

Yes, spreading out the index creation load that way could work. I would still recommend you change how yo are sharding the data though. This approach is just putting a band-aid on it and the next time you encounter issues due to the large shard count you may not be able to solve it and could potentially lose data.

Thanks. Yes, I understand, but we need first to take the easy steps and then, think about how to deal with the "big issue"

Cheers

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.