After updating logstash, errors started - Failed to perform request

The problem is in the operation of logstash.
In the old version 8.9 there were no problems with pipelines, but as soon as they updated to 8.10 on one node I started getting errors:

[logstash.outputs.elasticsearch][aws-pipe] Failed to perform request {:message=>"Connection pool shut down", :exception=>Manticore::ClientStoppedException, :cause=>#<Java::JavaLang::IllegalStateException: Connection pool shut down>}
Apr 18 11:39:53 logstash.my logstash[587165]: [2024-04-18T11:39:53,789][WARN ][logstash.outputs.elasticsearch][aws-pipe] Attempted to resurrect connection to dead ES instance, but got an error {:url=>"https://logstash_int:xxxxxx@ingest.my:9200/", :exception=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :message=>"Elasticsearch Unreachable: [https://ingest.my:9200/][Manticore::ClientStoppedException] Connection pool shut down"}

When I connect a new pipeline, similar errors start appearing, when I disable it they stop. Moreover, it doesn’t make sense what kind of pipeline it is - it’s just one list that worked well before. Now I’m very worried and afraid to update Logstash to a new version so that similar problems do not start on this node.

The only thing I noticed on ingests where all the data is transferred from logstash is that there is a very high load on the network interface.

I will be grateful for any ideas

These errors are connection erros, it means that logstash is having problem to connect to Elasticsearch.

This is exactly what this message says:

Elasticsearch Unreachable: [https://ingest.my:9200]

It is unrelated to the version, are you sure that you weren't having this same issue with the previous version? Can you check on your old logs?

Did you rollback to version 8.9 ?

no, I transferred part of the configuration to the old version node of logstash.
Now I’ll try to analyze the logs as a whole, maybe this is a cumulative problem.

I noticed this warning on the spot:

[2024-04-18T09:58:52,770][WARN ][o.e.c.r.a.a.DesiredBalanceReconciler] [master02.my] [10.5%] of assigned shards (198/1873) are not on their desired nodes, which exceeds the warn threshold of [10%]

and logstash:

Apr 18 13:45:51 logstash.my logstash[602171]: [2024-04-18T13:45:51,068][ERROR][logstash.outputs.elasticsearchmonitoring][.monitoring-logstash][8f3ca24476f388a49f1e24ad046fefc77d19161f2378037b837f932f900ed390] Encountered a retryable error (will retry with exponential backoff) {:code=>503, :url=>"https://ingest01.my:9200/_monitoring/bulk?system_id=logstash&system_api_version=7&interval=1s", :content_length=>95272, :body=>"{\"error\":{\"root_cause\":[{\"type\":\"cluster_block_exception\",\"reason\":\"blocked by: [SERVICE_UNAVAILABLE/2/no master];\"}],\"type\":\"cluster_block_exception\",\"reason\":\"blocked by: [SERVICE_UNAVAILABLE/2/no master];\"},\"status\":503}"}

It seems that you have an Elasticseach issue, no a Logstash.

SERVICE_UNAVAILABLE/2/no master

This means that there is no master active in your cluster, so your cluster is not working right.

the master is defined on the web interface, but in the logstash logs I still see messages like this:

[ERROR][logstash.licensechecker.licensereader] Unable to retrieve license information from license server {:message=>"No Available connections"}

and

[ERROR][logstash.licensechecker.licensereader] Unable to retrieve Elasticsearch cluster info. {:message=>"No Available connections", :exception=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError}

and those previous messages.

Because of this, I am afraid that after the update everything will stop working for me, because... I don’t see any silent errors on the second duplicate node with the old version...

The master is trying to figure out why the master is unclear, but so far without success.

Not sure what you mean with this, your logstash errors are pretty on point, they mean that Logstash cannot connect to your Elasticsearch cluster, for Logstash to work you first need to solve your Elasticsearch issue.

Your Elasticsearch errors indicates that your cluster has not elected a master yet.

[SERVICE_UNAVAILABLE/2/no master]

So, from what you shared it seems that you are having issues with your Elasticsearch cluster.

What is the result when you run a curl to it? For example curl https://your-cluster:9200

I mean that when I check the status of the cluster, I see which of the masters is selected as the main one.

when I make a request I get this response:

 curl -X GET "https://ingest02.my:9200/_cluster/health?pretty" -u elastic:'xxxxxx'
{
  "cluster_name" : "my-elk-prod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 10,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 1111,
  "active_shards" : 1602,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0

My data goes to ingest and then to the database. And I noticed that the ingests are losing contact with the elastic cluster, but I can’t understand why this is happening... :frowning:

On the master I see the following messages:

[2024-04-25T09:26:47,511][INFO ][o.e.c.c.NodeJoinExecutor ] [master02.my] node-join: [{ingest01.my}{pZK4-zUxTzGbfKIcLFVkwg}{Lggd9-dUSRCEVA_gJgcBJQ}{ingest01.my}{10.1.6.13}{10.16.6.13:9301}{it}{8.12.2}{7000099-8500010}] with reason [rejoining]
[2024-04-25T09:27:55,822][INFO ][o.e.t.TcpTransport       ] [master02.my] close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.1.5.211:9301, remoteAddress=/10.1.6.13:55162, profile=default}], disconnecting from relevant node: Connection reset

the following parameter is configured on NGFW
tcp session timeout: 3600sec

could this be causing the problems? since I saw that the cluster must constantly maintain the connection

I did a little research on the available logs and came to the conclusion that the problem is with the ingests, they are the ones who lose connection with the master, and the most interesting thing is that not all of them at the same time, but this happens one by one - first one loses connection, and then after a certain time another ingest cannot reach the master. Despite the fact that other elastic nodes do not lose masters and continue to work as normal.
Does anyone have any ideas on why this might be happening. I changed the tcp session timeout on all nodes to 3500 seconds, but this did not bring any results; data from lostash still periodically cannot get into elastic.
I will be glad to go to anyone, mine is already finished :frowning: