Slow ingestion in logstash 5.3.0

On our 5.3.0 ES cluster, on the 27th of October, we've had a problem caused by a corrupt curator binary on our node dedicated to Kibana and curator.
We did not realize until the 29th that we were no longer ingesting data because we had ran low on disk space, due to curator not being executed to clean up old logs.
We've fixed that but we still get an abnormal number of errors and we're seeing data taking a lot more to to be ingested.
We've added more nodes of all kinds, bar master.
Our pre-27th setup was as follows:
6x data/ingest nodes w/ 8 vCPUs and 56 GB RAM
4x logstash nodes w/ 4 vCPUs and 14 GB RAM
These are behind a load balancer w/ 2 being used for one production environment and the other 2 nodes for a different production environment
2x master nodes w/ 4 vCPUs and 7 GB RAM
2x client nodes w/ 4 vCPUs and 7 GB RAM

We have since added 2 more logstash nodes (1 for each pool), 2 more data nodes that we temporarily assigned only some indices onto but didn't see any improvement in how fast they would get data ingested, and 2 more client nodes.
We've also (as of today), increased the size and jvms/x values for our logstash nodes, which seems to have reduced the number of errors, but it's too early to tell for sure.
We're using jvms/x values of 1/2 the available memory on each box, except the data nodes where we use 32g for said values.

Comparing number of documents in indices from the 27th to the 31st, taken on the 1st and today (7th), we can see that data does seem to eventually get ingested, but at a much slower pace than it used to before this problem occurred.

We keep getting these in our logstash error logs:

[2018-11-07T12:59:47,728][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}
[2018-11-07T12:59:47,728][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}

And these in our machines running filebeat:

2018-11-01T14:08:16Z ERR Failed to publish events caused by: write tcp 1.1.1.1:49717->2.2.2.2:5044: wsasend: An existing connection was forcibly closed by the remote host.
2018-11-01T14:08:16Z INFO Error publishing events (retrying): write tcp 1.1.1.1:49717->2.2.2.2:5044: wsasend: An existing connection was forcibly closed by the remote host.

(IP addresses sanitized)

Oddly enough, for our machines in the 2nd production environment, running linux, we seem to be ingesting data at the proper rate...

We ingest between 100 and 200 million documents per day, with roughly 50m being from the 2nd (linux) environment and coming from a dozen or so machines, and the remaining 150m being from around 200 machines running Windows server.

What can we check/try to fix this? I understand more data may be needed but we're willing to provide that, within reason. :slight_smile:

Thanks in advance for any and all suggestions/hints/pointers/whatever.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.