High cpu usage after ES update to 7.17.7

Hello,

after an update from 7.16.2 to 7.17.7, we're experiencing about double the CPU usage without changing anything else. This happens during all time and the usage basically never dropped below the new baseline. When looking into the hot_threads on a node, I receive the following:

Can someone please tell me what the transport_worker does and how I may be able to fix this?

Thanks in advance

Only thing we changed was to add 3 dedicated master nodes, so we got:

3 dedicated master nodes
19 es nodes
1 coordinating node

What would be the past practice how the logstash output plugin elasticsearch should be configured? Is it best to only send requests to the coordinating node and maybe add some more of them or add every data node + coordinating node?

What was your previous cluster pattern before setting the dedicated nodes?

Also, we cannot say directly on why the CPU spike has happened.

Also, is the configuration of the new changed cluster different from before?

What would be the past practice how the logstash output plugin elasticsearch should be configured? Is it best to only send requests to the coordinating node and maybe add some more of them or add every data node + coordinating node?

For this, I'm sure it's not necessary to have a coordinating node for every data node.

Before, basically every of the 19 node were eligible master + data nodes and we had 3 coordinating nodes.

Well, what bothers me is the fact that the spike happened as soon as we updated to 7.17.7.

No, we didn't change the configuration at all.

So now basically you added 4 more nodes which is 3+19+1 = 23.

Is my understanding right?

I'm feel that, it definitely might not be a version upgrade issue.

It should either be the architecture change you've done.

It would be great to guide if you share before and after architecture along with node configuration.

Before:

19 data nodes / master eligible nodes
3 coordinating nodes

Now:

19 data nodes
3 dedicated master nodes

We didn't add more nodes, it's just the 3 coordinating nodes are now 3 dedicated master nodes.

Understood. In general, rather than having coordinating node on data node, it's better to have a dedicated coordinating node as it's the node basically responsible for sending requests.

I suggest you do that if possible.

Just did that about 20 minutes ago, so now we got:

19 data nodes
3 dedicated master nodes
1 dedicated coordinating node

Load doesn't seem to go down, though.

I'm not sure, why that is happening in your case then.

But, one way you can identify it is by enabling stack monitoring and see at what time the peak has occurred and which is the process which is consuming high CPU, etc.

What is the output from the _cluster/stats?pretty&human API?

The only other thing we did was update the java version. Yesterday, we tried running the Elasticsearch with the bundled JDK (only one node, though), but it wouldn't seem to boot.

{
  "_nodes" : {
    "total" : 23,
    "successful" : 23,
    "failed" : 0
  },
  "cluster_name" : "x",
  "cluster_uuid" : "jhVevJxxxxxqpsw-HXGTDQ",
  "timestamp" : 1671087290060,
  "status" : "green",
  "indices" : {
    "count" : 999,
    "shards" : {
      "total" : 2202,
      "primaries" : 1100,
      "replication" : 1.0018181818181817,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 19,
          "avg" : 2.204204204204204
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.1011011011011012
        },
        "replication" : {
          "min" : 0.0,
          "max" : 18.0,
          "avg" : 1.014014014014014
        }
      }
    },
    "docs" : {
      "count" : 51953031762,
      "deleted" : 257029
    },
    "store" : {
      "size" : "66.6tb",
      "size_in_bytes" : 73322893796775,
      "total_data_set_size" : "66.6tb",
      "total_data_set_size_in_bytes" : 73322893796775,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "8.5mb",
      "memory_size_in_bytes" : 8935928,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "313mb",
      "memory_size_in_bytes" : 328205495,
      "total_count" : 56055651,
      "hit_count" : 2526064,
      "miss_count" : 53529587,
      "cache_size" : 1621,
      "cache_count" : 60146,
      "evictions" : 58525
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 58458,
      "memory" : "2.6gb",
      "memory_in_bytes" : 2884731444,
      "terms_memory" : "2.1gb",
      "terms_memory_in_bytes" : 2297977536,
      "stored_fields_memory" : "190.2mb",
      "stored_fields_memory_in_bytes" : 199540880,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "180mb",
      "norms_memory_in_bytes" : 188778688,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "189.2mb",
      "doc_values_memory_in_bytes" : 198434340,
      "index_writer_memory" : "1.3gb",
      "index_writer_memory_in_bytes" : 1473972702,
      "version_map_memory" : "7.8kb",
      "version_map_memory_in_bytes" : 8074,
      "fixed_bit_set" : "2.7gb",
      "fixed_bit_set_memory_in_bytes" : 2931549952,
      "max_unsafe_auto_id_timestamp" : 1671087220430,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 240,
          "index_count" : 240,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 6879,
          "index_count" : 467,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 783,
          "index_count" : 261,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 15174,
          "index_count" : 951,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 2808,
          "index_count" : 234,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 1226,
          "index_count" : 253,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 2208,
          "index_count" : 240,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 36,
          "index_count" : 12,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 3939,
          "index_count" : 279,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 393093,
          "index_count" : 956,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 33260,
          "index_count" : 914,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 3059,
          "index_count" : 251,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 74066,
          "index_count" : 944,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 240,
          "index_count" : 240,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 65475,
          "index_count" : 956,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 9,
          "index_count" : 9,
          "script_count" : 0
        },
        {
          "name" : "wildcard",
          "count" : 3978,
          "index_count" : 234,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [
        {
          "name" : "keyword",
          "count" : 2,
          "index_count" : 2
        }
      ]
    },
    "versions" : [
      {
        "version" : "6.5.4",
        "index_count" : 3,
        "primary_shard_count" : 3,
        "total_primary_size" : "1.1mb",
        "total_primary_bytes" : 1173749
      },
      {
        "version" : "6.8.3",
        "index_count" : 1,
        "primary_shard_count" : 1,
        "total_primary_size" : "1.1mb",
        "total_primary_bytes" : 1235805
      },
      {
        "version" : "7.5.2",
        "index_count" : 4,
        "primary_shard_count" : 4,
        "total_primary_size" : "898.3kb",
        "total_primary_bytes" : 919928
      },
      {
        "version" : "7.8.1",
        "index_count" : 11,
        "primary_shard_count" : 39,
        "total_primary_size" : "143.1mb",
        "total_primary_bytes" : 150137569
      },
      {
        "version" : "7.13.4",
        "index_count" : 9,
        "primary_shard_count" : 25,
        "total_primary_size" : "22.7mb",
        "total_primary_bytes" : 23883042
      },
      {
        "version" : "7.16.1",
        "index_count" : 6,
        "primary_shard_count" : 6,
        "total_primary_size" : "7.8mb",
        "total_primary_bytes" : 8228693
      },
      {
        "version" : "7.16.2",
        "index_count" : 749,
        "primary_shard_count" : 769,
        "total_primary_size" : "25.5tb",
        "total_primary_bytes" : 28124589671772
      },
      {
        "version" : "7.17.7",
        "index_count" : 216,
        "primary_shard_count" : 253,
        "total_primary_size" : "7.7tb",
        "total_primary_bytes" : 8473670439298
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 23,
      "coordinating_only" : 1,
      "data" : 19,
      "data_cold" : 0,
      "data_content" : 0,
      "data_frozen" : 0,
      "data_hot" : 0,
      "data_warm" : 0,
      "ingest" : 0,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 0,
      "voting_only" : 0
    },
    "versions" : [
      "7.17.7"
    ],
    "os" : {
      "available_processors" : 256,
      "allocated_processors" : 256,
      "names" : [
        {
          "name" : "Linux",
          "count" : 23
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "SUSE Linux Enterprise Server 15 SP3",
          "count" : 23
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 23
        }
      ],
      "mem" : {
        "total" : "1.1tb",
        "total_in_bytes" : 1282578382848,
        "free" : "118.1gb",
        "free_in_bytes" : 126916104192,
        "used" : "1tb",
        "used_in_bytes" : 1155662278656,
        "free_percent" : 10,
        "used_percent" : 90
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 606
      },
      "open_file_descriptors" : {
        "min" : 969,
        "max" : 3878,
        "avg" : 3181
      }
    },
    "jvm" : {
      "max_uptime" : "7.9d",
      "max_uptime_in_millis" : 685940030,
      "versions" : [
        {
          "version" : "11.0.16",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "11.0.16+8-suse-150000.3.83.1-x8664",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 23
        }
      ],
      "mem" : {
        "heap_used" : "204.9gb",
        "heap_used_in_bytes" : 220113856776,
        "heap_max" : "486.2gb",
        "heap_max_in_bytes" : 522110697472
      },
      "threads" : 3223
    },
    "fs" : {
      "total" : "103.6tb",
      "total_in_bytes" : 113925307760640,
      "free" : "36.7tb",
      "free_in_bytes" : 40458619760640,
      "available" : "36.7tb",
      "available_in_bytes" : 40458619760640
    },
    "plugins" : [
      {
        "name" : "search-guard-7",
        "version" : "7.17.7-53.5.0",
        "elasticsearch_version" : "7.17.7",
        "java_version" : "1.8",
        "description" : "Provide access control related features for Elasticsearch 7",
        "classname" : "com.floragunn.searchguard.SearchGuardPlugin",
        "extended_plugins" : [
          "lang-painless"
        ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 23
      },
      "http_types" : {
        "com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 23
      }
    },
    "discovery_types" : {
      "zen" : 23
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 23
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 3,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

I was not able to access the pastebin you posted a link to. Can you upload it again and provide a new link?

Without seeing the full hot threads output it is difficult to be precise, but transport worker relates to communication to or between the nodes. An important part of that is securing communication across the connections, which means that the issue could lie with your use of the Searchguard plugin, which is not supported here. I would therefore recommend switching to the built in security and see if that changes anything or reach out to the SearchGuard community.

Sure.

https://pastebin.com/raw/zV8zzy0n

There is a lot of SearchGuard related lines in the hot threads, so I suspect this issue is likely related to SearchGuard.

This does reinforce my recommendation above.

1 Like

Thank you. I contacted Searchguard and will update this thread as soon as I got a solution as it might concern others as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.