Occasional Index Management timeout

Occasionally (generally at least once per day) when attempting to access data in Kibana, I'll get a strange series of timeouts. I generally first notice it because Discover view gets an error saying "Error: Could not locate that index-pattern-field (id: Timestamp)". I've noticed that concurrently, trying to access the list of indices in Kibana Stack Management will timeout ("Error Loading Indices, Request timeout after 30000ms") and REST calls to attempt to see indices, nodes, and shards will all hang (though cluster health responds immediately. Looking at Stack Monitoring (on an independent monitoring cluster) will show Kafka getting backed up, indicating that ingestion is slowing. Direct data on my ES data node doesn't seem to be reported during this time, though there's nothing out of the ordinary on the active master node. Generally things will clear up after several minutes (sometimes upwards of 15-20), and it generally only clears up when I see my Kafka consumer group lag drop back down to 0, which makes me think something about ingestion is hogging all of the resources. However, other times I'll see Kafka get a bit backed up for a minute with no adverse effect on new requests.

My cluster layout is one data node, three master nodes, and a coordinator node that talks to Kibana. I've got about 600GB of data and ~200 shards. I've tried resizing the cluster, adding and removing data nodes, different masters, etc. and nothing seems to stop this from happening. The only possible culprit I can think of is the fact that all of this is running on a single machine, but it's a veritable beast of a machine--256GB RAM, multiple CPUs, a powerful SSD, etc. The coord has 10GB heap size, data has 30GB, and each master has 16GB.

Any insight on what may be happening here? Happy to provide any other info if it would be helpful in diagnosing this problem.

What is the output from the _cluster/stats?pretty&human API?
What version are you on?
What do your Elasticsearch logs show?

Hey Mark,
Version is 7.12.0, though this has happened consistently for the last several (starting at 7.10.0 which is what I first installed this cluster on.) Another thing I probably should've mentioned is that each node is running in a docker container.

Scrolling through the logs at the time of the event yesterday, pretty much everything I'm seeing looks like this:

[elasticsearch.server][WARN] failed to retrieve shard stats from node [GPhxT1vaSJe28koqMC1lrA]: [es-data1][***][indices:monitor/stats[n]] request_id [51668292] timed out after [15005ms]

Along with corresponding messages showing eventual responses, between 25s and 15m later.

Here's the stats output now (no issue at the moment, I'll try to capture one later today if the problem happens again for comparison)

{
  "_nodes" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "cluster_name" : "my-cluster",
  "cluster_uuid" : "my-uuid",
  "timestamp" : 1622034173125,
  "status" : "green",
  "indices" : {
    "count" : 206,
    "shards" : {
      "total" : 206,
      "primaries" : 206,
      "replication" : 0.0,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 0.0,
          "max" : 0.0,
          "avg" : 0.0
        }
      }
    },
    "docs" : {
      "count" : 1992111813,
      "deleted" : 22353
    },
    "store" : {
      "size_in_bytes" : 507784642462,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 182552,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 648062,
      "total_count" : 29206871,
      "hit_count" : 2047801,
      "miss_count" : 27159070,
      "cache_size" : 5416,
      "cache_count" : 45240,
      "evictions" : 39824
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 5273,
      "memory_in_bytes" : 76208396,
      "terms_memory_in_bytes" : 55658840,
      "stored_fields_memory_in_bytes" : 2879816,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 7688448,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 9981292,
      "index_writer_memory_in_bytes" : 3178399772,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 250832,
      "max_unsafe_auto_id_timestamp" : 1621858669266,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 275,
          "index_count" : 130
        },
        {
          "name" : "constant_keyword",
          "count" : 2,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 299,
          "index_count" : 197
        },
        {
          "name" : "date_nanos",
          "count" : 182,
          "index_count" : 182
        },
        {
          "name" : "double",
          "count" : 182,
          "index_count" : 182
        },
        {
          "name" : "float",
          "count" : 851,
          "index_count" : 182
        },
        {
          "name" : "ip",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 6019,
          "index_count" : 197
        },
        {
          "name" : "long",
          "count" : 1407,
          "index_count" : 189
        },
        {
          "name" : "nested",
          "count" : 8,
          "index_count" : 8
        },
        {
          "name" : "object",
          "count" : 368,
          "index_count" : 196
        },
        {
          "name" : "text",
          "count" : 6599,
          "index_count" : 197
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.10.2",
        "index_count" : 9,
        "primary_shard_count" : 9,
        "total_primary_bytes" : 6599500
      },
      {
        "version" : "7.11.0",
        "index_count" : 9,
        "primary_shard_count" : 9,
        "total_primary_bytes" : 76385336
      },
      {
        "version" : "7.12.0",
        "index_count" : 188,
        "primary_shard_count" : 188,
        "total_primary_bytes" : 507701657626
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 5,
      "coordinating_only" : 0,
      "data" : 1,
      "data_cold" : 0,
      "data_content" : 0,
      "data_frozen" : 0,
      "data_hot" : 0,
      "data_warm" : 0,
      "ingest" : 1,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 0,
      "voting_only" : 0
    },
    "versions" : [
      "7.12.0"
    ],
    "os" : {
      "available_processors" : 640,
      "allocated_processors" : 640,
      "names" : [
        {
          "name" : "Linux",
          "count" : 5
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 8",
          "count" : 5
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 5
        }
      ],
      "mem" : {
        "total_in_bytes" : 1349931417600,
        "free_in_bytes" : 21592543232,
        "used_in_bytes" : 1328338874368,
        "free_percent" : 2,
        "used_percent" : 98
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 1
      },
      "open_file_descriptors" : {
        "min" : 754,
        "max" : 8621,
        "avg" : 2348
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 4987556072,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 5
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 38936636224,
        "heap_max_in_bytes" : 94489280512
      },
      "threads" : 1367
    },
    "fs" : {
      "total_in_bytes" : 4148140335104,
      "free_in_bytes" : 3294954549248,
      "available_in_bytes" : 3084122091520
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 5
      },
      "http_types" : {
        "netty4" : 5
      }
    },
    "discovery_types" : {
      "zen" : 5
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 5
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 68,
      "processor_stats" : {
        "append" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "convert" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "date" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "dot_expander" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "json" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "pipeline" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "split" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "user_agent" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

So the issue just cropped up again. Unfortunately the cluster stats API call times out when this issue is happening. Another interesting data point: when trying to access the Discover tab in Kibana in a fresh browser window, I get the error pictured below. However, if I refresh rows in an already existing Discover tab, it can retrieve entries just fine--it really seems like this is somehow an issue with the stats or index names or something that I'm running into, rather than an overall system bottleneck. Hopefully that helps narrow it down for someone.

Another potentially interesting data point: I restarted my cluster, and the cluster was unable to finish allocating shards, or return REST calls, or provide metrics, until I turned off my Logstash instances--it was, however, indexing data on the indexes that did have allocated shards. Once I turned off Logstash, it allocated all shards and began successfully responding to calls within a minute. It's acting as though the CPU is pegged, but when I look at htop, almost none of the CPUs are utilized.

And another interesting datapoint:

In the results of _cat/thread_pool?v&h=id,name,queue,rejected,completed (which will eventually return while the cluster is in this goofy state, though it takes a long time) looks like this:

id                     name                                   queue rejected completed
KxB7zYzZSRyxkcknPWc5Pg analyze                                    0        0         0
KxB7zYzZSRyxkcknPWc5Pg ccr                                        0        0         0
KxB7zYzZSRyxkcknPWc5Pg fetch_shard_started                        0        0         0
KxB7zYzZSRyxkcknPWc5Pg fetch_shard_store                          0        0         0
KxB7zYzZSRyxkcknPWc5Pg flush                                      0        0         0
KxB7zYzZSRyxkcknPWc5Pg force_merge                                0        0         0
KxB7zYzZSRyxkcknPWc5Pg generic                                    0        0      7796
KxB7zYzZSRyxkcknPWc5Pg get                                        0        0         0
KxB7zYzZSRyxkcknPWc5Pg listener                                   0        0         0
KxB7zYzZSRyxkcknPWc5Pg management                                 0        0      4955
KxB7zYzZSRyxkcknPWc5Pg ml_datafeed                                0        0         0
KxB7zYzZSRyxkcknPWc5Pg ml_job_comms                               0        0         0
KxB7zYzZSRyxkcknPWc5Pg ml_utility                                 0        0       845
KxB7zYzZSRyxkcknPWc5Pg refresh                                    0        0         0
KxB7zYzZSRyxkcknPWc5Pg rollup_indexing                            0        0         0
KxB7zYzZSRyxkcknPWc5Pg search                                     0        0         0
KxB7zYzZSRyxkcknPWc5Pg search_throttled                           0        0         0
KxB7zYzZSRyxkcknPWc5Pg searchable_snapshots_cache_fetch_async     0        0         0
KxB7zYzZSRyxkcknPWc5Pg searchable_snapshots_cache_prewarming      0        0         0
KxB7zYzZSRyxkcknPWc5Pg snapshot                                   0        0         0
KxB7zYzZSRyxkcknPWc5Pg system_read                                0        0       126
KxB7zYzZSRyxkcknPWc5Pg system_write                               0        0         0
KxB7zYzZSRyxkcknPWc5Pg transform_indexing                         0        0         0
KxB7zYzZSRyxkcknPWc5Pg warmer                                     0        0         0
KxB7zYzZSRyxkcknPWc5Pg watcher                                    0        0         0
KxB7zYzZSRyxkcknPWc5Pg write                                      0        0         0
C1SP7qAUSCyFGWx8NpCd9w analyze                                    0        0         0
C1SP7qAUSCyFGWx8NpCd9w ccr                                        0        0         0
C1SP7qAUSCyFGWx8NpCd9w fetch_shard_started                        0        0         0
C1SP7qAUSCyFGWx8NpCd9w fetch_shard_store                          0        0         0
C1SP7qAUSCyFGWx8NpCd9w flush                                      0        0         0
C1SP7qAUSCyFGWx8NpCd9w force_merge                                0        0         0
C1SP7qAUSCyFGWx8NpCd9w generic                                    0        0      1894
C1SP7qAUSCyFGWx8NpCd9w get                                        0        0         0
C1SP7qAUSCyFGWx8NpCd9w listener                                   0        0         0
C1SP7qAUSCyFGWx8NpCd9w management                                 0        0      2519
C1SP7qAUSCyFGWx8NpCd9w ml_datafeed                                0        0         0
C1SP7qAUSCyFGWx8NpCd9w ml_job_comms                               0        0         0
C1SP7qAUSCyFGWx8NpCd9w ml_utility                                 0        0       841
C1SP7qAUSCyFGWx8NpCd9w refresh                                    0        0         0
C1SP7qAUSCyFGWx8NpCd9w rollup_indexing                            0        0         0
C1SP7qAUSCyFGWx8NpCd9w search                                     0        0         0
C1SP7qAUSCyFGWx8NpCd9w search_throttled                           0        0         0
C1SP7qAUSCyFGWx8NpCd9w searchable_snapshots_cache_fetch_async     0        0         0
C1SP7qAUSCyFGWx8NpCd9w searchable_snapshots_cache_prewarming      0        0         0
C1SP7qAUSCyFGWx8NpCd9w snapshot                                   0        0         0
C1SP7qAUSCyFGWx8NpCd9w system_read                                0        0         0
C1SP7qAUSCyFGWx8NpCd9w system_write                               0        0         0
C1SP7qAUSCyFGWx8NpCd9w transform_indexing                         0        0         0
C1SP7qAUSCyFGWx8NpCd9w warmer                                     0        0         0
C1SP7qAUSCyFGWx8NpCd9w watcher                                    0        0         0
C1SP7qAUSCyFGWx8NpCd9w write                                      0        0         0
Syx8kjYLTmOIgam0gsgdPA analyze                                    0        0         0
Syx8kjYLTmOIgam0gsgdPA ccr                                        0        0         0
Syx8kjYLTmOIgam0gsgdPA fetch_shard_started                        0        0         0
Syx8kjYLTmOIgam0gsgdPA fetch_shard_store                          0        0         0
Syx8kjYLTmOIgam0gsgdPA flush                                      0        0         0
Syx8kjYLTmOIgam0gsgdPA force_merge                                0        0         0
Syx8kjYLTmOIgam0gsgdPA generic                                    0        0      1907
Syx8kjYLTmOIgam0gsgdPA get                                        0        0         0
Syx8kjYLTmOIgam0gsgdPA listener                                   0        0         0
Syx8kjYLTmOIgam0gsgdPA management                                 0        0      2520
Syx8kjYLTmOIgam0gsgdPA ml_datafeed                                0        0         0
Syx8kjYLTmOIgam0gsgdPA ml_job_comms                               0        0         0
Syx8kjYLTmOIgam0gsgdPA ml_utility                                 0        0       843
Syx8kjYLTmOIgam0gsgdPA refresh                                    0        0         0
Syx8kjYLTmOIgam0gsgdPA rollup_indexing                            0        0         0
Syx8kjYLTmOIgam0gsgdPA search                                     0        0         0
Syx8kjYLTmOIgam0gsgdPA search_throttled                           0        0         0
Syx8kjYLTmOIgam0gsgdPA searchable_snapshots_cache_fetch_async     0        0         0
Syx8kjYLTmOIgam0gsgdPA searchable_snapshots_cache_prewarming      0        0         0
Syx8kjYLTmOIgam0gsgdPA snapshot                                   0        0         0
Syx8kjYLTmOIgam0gsgdPA system_read                                0        0         0
Syx8kjYLTmOIgam0gsgdPA system_write                               0        0         0
Syx8kjYLTmOIgam0gsgdPA transform_indexing                         0        0         0
Syx8kjYLTmOIgam0gsgdPA warmer                                     0        0         0
Syx8kjYLTmOIgam0gsgdPA watcher                                    0        0         0
Syx8kjYLTmOIgam0gsgdPA write                                      0        0         0
zMUxvgq5RL-K5xFP_svEFw analyze                                    0        0         0
zMUxvgq5RL-K5xFP_svEFw ccr                                        0        0         0
zMUxvgq5RL-K5xFP_svEFw fetch_shard_started                        0        0       212
zMUxvgq5RL-K5xFP_svEFw fetch_shard_store                          0        0        24
zMUxvgq5RL-K5xFP_svEFw flush                                      0        0        32
zMUxvgq5RL-K5xFP_svEFw force_merge                                0        0         0
zMUxvgq5RL-K5xFP_svEFw generic                                    0        0     23453
zMUxvgq5RL-K5xFP_svEFw get                                        0        0         0
zMUxvgq5RL-K5xFP_svEFw listener                                   0        0         0
zMUxvgq5RL-K5xFP_svEFw management                                 0        0      5875
zMUxvgq5RL-K5xFP_svEFw ml_datafeed                                0        0         0
zMUxvgq5RL-K5xFP_svEFw ml_job_comms                               0        0         0
zMUxvgq5RL-K5xFP_svEFw ml_utility                                 0        0       622
zMUxvgq5RL-K5xFP_svEFw refresh                                    0        0     24187
zMUxvgq5RL-K5xFP_svEFw rollup_indexing                            0        0         0
zMUxvgq5RL-K5xFP_svEFw search                                     0        0         0
... cutting some for post length ...
Lq-k-ysUSC-9aGDDs9nE7g transform_indexing                         0        0         0
Lq-k-ysUSC-9aGDDs9nE7g warmer                                     0        0         0
Lq-k-ysUSC-9aGDDs9nE7g watcher                                    0        0         0
Lq-k-ysUSC-9aGDDs9nE7g write                                      0        0         0
GPhxT1vaSJe28koqMC1lrA analyze                                    0        0         0
GPhxT1vaSJe28koqMC1lrA ccr                                        0        0         0
GPhxT1vaSJe28koqMC1lrA fetch_shard_started                        0        0       212
GPhxT1vaSJe28koqMC1lrA fetch_shard_store                          0        0        24
GPhxT1vaSJe28koqMC1lrA flush                                      0        0        30
GPhxT1vaSJe28koqMC1lrA force_merge                                0        0         0
GPhxT1vaSJe28koqMC1lrA generic                                    0        0     53414
GPhxT1vaSJe28koqMC1lrA get                                        0        0         0
GPhxT1vaSJe28koqMC1lrA listener                                   0        0         0
GPhxT1vaSJe28koqMC1lrA management                             14442        0     23908
GPhxT1vaSJe28koqMC1lrA ml_datafeed                                0        0         0
GPhxT1vaSJe28koqMC1lrA ml_job_comms                               0        0         0
GPhxT1vaSJe28koqMC1lrA ml_utility                                 0        0       712
GPhxT1vaSJe28koqMC1lrA refresh                                    0        0     72130
GPhxT1vaSJe28koqMC1lrA rollup_indexing                            0        0         0
GPhxT1vaSJe28koqMC1lrA search                                     0        0         0
GPhxT1vaSJe28koqMC1lrA search_throttled                           0        0         0
GPhxT1vaSJe28koqMC1lrA searchable_snapshots_cache_fetch_async     0        0         0
GPhxT1vaSJe28koqMC1lrA searchable_snapshots_cache_prewarming      0        0         0
GPhxT1vaSJe28koqMC1lrA snapshot                                   0        0         0
GPhxT1vaSJe28koqMC1lrA system_read                                0        0       177
GPhxT1vaSJe28koqMC1lrA system_write                               0        0        69
GPhxT1vaSJe28koqMC1lrA transform_indexing                         0        0         0
GPhxT1vaSJe28koqMC1lrA warmer                                     0        0       358
GPhxT1vaSJe28koqMC1lrA watcher                                    0        0         0
GPhxT1vaSJe28koqMC1lrA write                                    113        0    125151

Notably, when I refresh, the management queue keeps increasing. In my case, I have my master nodes separate from my data nodes, so I'd think management tasks shouldn't get backed up, but perhaps I've configured something wrong?

Today I've tried tweaking memory amounts, pinning my containers to cpu ranges so I know nothing is overlapping that way, adding a second data node, and a few other tweaks, but nothing has improved the situation--when the data nodes are actively indexing, their stats calls will never respond. If I turn off logstash, all calls instantly return, so something is definitely getting starved out. Non-stats data requests still go through just fine.

It actually sounds very similar to this issue, however I don't see any completionStats requests coming through on our hot threads--or any management tasks at all, really (which makes sense if they're just getting outright starved.)

Can you share your hot threads?

Here's a few samples, all taken around a minute apart--no tasks shared between them (the one Lucene merge looks similar, but it's a merge for a different index):

Hot Threads #1
Hot Threads #2
Hot Threads #3

Not seeing much in there unfortunately.

Can you upgrade? 7.12 or 7.13 would be good.

Currently on 7.12 right now, I can try upgrading to 7.13 today.

Updated to 7.13 and still seem to be having the issue. It's annoying that it presents so similarly to the other thread I linked, but seems to have a different cause that doesn't show up in the hot threads =\

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.