Metricbeats timeout

Hi guys,
I've a Cluster of 30 computers with 25 Virtual Machine.
Every virtual machine has MetricBeat and FileBeat installed that send metrics and logs to a physical Elasticsearch server.

Randomly I've some timeout errors in metricbeats logs:

2017-02-03T18:06:02+01:00 INFO Non-zero metrics in the last 30s: fetches.system-filesystem.events=42 fetches.system-memory.success=2 fetches.system-network.success=3 fetches.system-socket.events=401 fetches.system-filesystem.success=2 fetches.system-fsstat.success=2 fetches.system-load.success=2 libbeat.logstash.published_and_acked_events=7 libbeat.publisher.published_events=834 libbeat.logstash.publish.write_bytes=285 fetches.system-fsstat.events=2 fetches.system-memory.events=2 fetches.system-socket.success=3 fetches.system-load.events=2 fetches.system-process.success=3 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=6 fetches.system-network.events=6 fetches.system-process.events=368 fetches.system-cpu.events=2 fetches.system-cpu.success=2 libbeat.publisher.messages_in_worker_queues=835
2017-02-03T18:06:12+01:00 ERR Failed to publish events caused by: read tcp 192.168.3.44:43498->172.10.1.250:5044: i/o timeout
2017-02-03T18:06:12+01:00 INFO Error publishing events (retrying): read tcp 192.168.3.44:43498->172.10.1.250:5044: i/o timeout
2017-02-03T18:06:32+01:00 INFO Non-zero metrics in the last 30s: fetches.system-network.success=3 fetches.system-process.events=284 libbeat.publisher.messages_in_worker_queues=532 fetches.system-filesystem.success=3 fetches.system-filesystem.events=45 fetches.system-process.success=3 libbeat.outputs.messages_dropped=1 fetches.system-cpu.success=3 fetches.system-fsstat.events=3 fetches.system-fsstat.success=3 fetches.system-cpu.events=3 libbeat.logstash.call_count.PublishEvents=1 fetches.system-socket.events=185 libbeat.logstash.publish.write_bytes=2737 fetches.system-socket.success=3 libbeat.logstash.publish.read_errors=1 libbeat.publisher.published_events=532 fetches.system-load.events=3 fetches.system-load.success=3 fetches.system-memory.events=3 fetches.system-memory.success=3 fetches.system-network.events=6 libbeat.logstash.published_but_not_acked_events=1
2017-02-03T18:07:02+01:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.success=3 fetches.system-network.success=3 libbeat.publisher.messages_in_worker_queues=292 fetches.system-load.events=3 fetches.system-cpu.events=3 fetches.system-memory.events=3 fetches.system-memory.success=3 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=7439 libbeat.publisher.published_events=293 fetches.system-filesystem.events=31 fetches.system-load.success=3 fetches.system-process.success=3 fetches.system-socket.success=2 fetches.system-filesystem.success=3 fetches.system-fsstat.events=3 fetches.system-network.events=5 fetches.system-process.events=193 fetches.system-fsstat.success=3 fetches.system-socket.events=61

Usually if I reboot the metricbeat service every starts to work fine until the next timeout

My version of Elasticserach and metricbeat is 5.2

This is the configuration file of metricbeat

###################### Metricbeat Configuration Example #######################

#==========================  Modules configuration ============================
metricbeat.modules:

#------------------------------- System Module -------------------------------
- module: system
  metricsets:
    # CPU stats
    - cpu

    # System Load stats
    - load

    # Per CPU core stats
    #- core

    # IO stats
    #- diskio

    # Per filesystem stats
    - filesystem

    # File system summary stats
    - fsstat

    # Memory stats
    - memory

    # Network stats
    - network

    # Per process stats
    - process

    - socket
  enabled: true
  period: 10s
  processes: ['.*']



#================================ General =====================================


# Optional fields that you can specify to add additional information to the
# output.
#fields:
#  env: staging

#================================ Outputs =====================================

# Configure what outputs to use when sending the data collected by the beat.
# Multiple outputs may be used.

#-------------------------- Elasticsearch output ------------------------------
#output.elasticsearch:
  # Array of hosts to connect to.
  #hosts: ["localhost:9200"]

  # Optional protocol and basic auth credentials.
  #protocol: "https"
  #username: "elastic"
  #password: "changeme"

#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ["172.10.1.250:5044"]

  max_retries: 0

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

#================================ Logging =====================================


# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publish", "service".
#logging.selectors: ["*"]

I can connect without problem to logstash: 172.10.1.250:5044

The ping stats to elasicsearch server is fine:

PING 172.10.1.250 (172.10.1.250) 56(84) bytes of data.
64 bytes from 172.10.1.250: icmp_seq=1 ttl=63 time=0.664 ms
64 bytes from 172.10.1.250: icmp_seq=2 ttl=63 time=0.551 ms
64 bytes from 172.10.1.250: icmp_seq=3 ttl=63 time=0.651 ms
64 bytes from 172.10.1.250: icmp_seq=4 ttl=63 time=0.611 ms
64 bytes from 172.10.1.250: icmp_seq=5 ttl=63 time=0.604 ms

--- 172.10.1.250 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3997ms
rtt min/avg/max/mdev = 0.551/0.616/0.664/0.042 ms

Thanks

you're sending metrics to logstash, not elasticsearch. Is there a reason for doing so?

Do you also send logs to logstash?

Any logs in logstash? Any 'expensive' grok filters in logstash?

Can you try to increase the output.logstash.timeout to something like 5 minutes and see if problem persists?

@steffens Thank you for your answer .
I'm sending logs to logstash just because I was following the installation instruction (https://www.elastic.co/guide/en/beats/metricbeat/current/config-metricbeat-logstash.html) and I read:

If you want to use Logstash to perform additional processing on the data collected by Metricbeat, you need to configure Metricbeat to use Logstash.

Then I thought to directly use logstash to be able to add some processing in the future if I will need it.

It is better to send logs to elasticsearch directly?

I checked the logstash logs:

 [2017-02-06T08:00:53,789][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>#<URI::HTTP:0x475a26b9 URL:http://localhost:9200/>, :erro  r_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://localhost:9200/][Manticore::SocketException] Too many open files"}
    [2017-02-06T08:00:43,784][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
    [2017-02-06T08:00:43,785][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>#<URI::HTTP:0x475a26b9 URL:http://localhost:9200/>, :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://localhost:9200/][Manticore::SocketException] Too many open files"}
    [2017-02-06T08:00:48,786][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
    [2017-02-06T08:00:48,788][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>#<URI::HTTP:0x475a26b9 URL:http://localhost:9200/>, :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://localhost:9200/][Manticore::SocketException] Too many open files"}

There is a limit of how many metriks/logs I can send to logstash?
In /etc/init.d/elasticsearch I saw :

# Maximum number of open files
MAX_OPEN_FILES=65536

thanks you

I've update the ulimit configuration:

GET _nodes?filter_path=**.mlockall
result: "mlockall": true

 GET _nodes/stats/process?filter_path=**.max_file_descriptors
 result "max_file_descriptors": 65536

It is working a little bit better but stil have some sporadic timeout

Now the logstash logs are fine

Also in some nodes I can see (in metricbeat log)

2017-02-06T16:25:20+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=18 libbeat.logstash.publish.write_bytes=1778
2017-02-06T16:25:50+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=1176
2017-02-06T16:26:20+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=24 libbeat.logstash.publish.write_bytes=2739
2017-02-06T16:26:50+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=18 libbeat.logstash.publish.write_bytes=2334
2017-02-06T16:27:20+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=18 libbeat.logstash.publish.write_bytes=2797
2017-02-06T16:27:50+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=3604 libbeat.logstash.publish.read_bytes=24
2017-02-06T16:28:20+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1081 libbeat.logstash.publish.read_bytes=12
2017-02-06T16:28:50+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=24 libbeat.logstash.publish.write_bytes=2632
2017-02-06T16:29:20+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1229 libbeat.logstash.publish.read_bytes=12
2017-02-06T16:29:50+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=18 libbeat.logstash.publish.write_bytes=1831

if I restart the metricbeat service it logs the metric for 10/15 minutes and after I've to restart the service

What it could be?

hm... can logstash/elasticsearch cope with the amount of data metricbeat is sending? How many connections to elasticsearch does logstash maintain? How many connections beats->logstash do you have? How many metricbeat instances do you have running?

What kind of processing do you have in mind? Using elasticsearch ingest node, you can also do processing? You also want to use logstash to 'hide'elasticsearch from the outside (like, how do you secure your ES instance)?

This is my environment:

  1. A Physical machine with ubuntu 14 LTS on which I installed elasticsearch and logstash

  2. 30 physical machines with ubuntu 14 LTS on which I installed metricbeat. It sends the metrics to logstash instance installed on the first machine

  3. 20 Virtual Machines with ubuntu 14 LTS on which I installed filebeat (that monitors /var/log) and metricbeat. Both send the logs/metrics to logstash instance installed on the first machine.

Then I suppose the there are at least 70 tcp connections that sends logs/metrics to one logstash instance.

I don't have any processing in mind yet. Right now I'm using logstash just to hide elasticsearch form outside.

For any other info just ask.

Thank you

can you check amount of connections? How come you have to increase the number of file-descriptors if you "just" have 70 beats running.

Did you check logstash queue metrics (icoming vs. outgoing events?) and CPU/memory usage (e.g. did you consider to increase number of workers or add more logstash instances)?

You can monitor logstash either via Monitoring API or via x-pack monitoring (requires a free basic license, but much more convenient then using API).

I Installed x-pack

I logged in one of the nodes and I still see the timeout:

2017-02-07T11:42:43+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=18 libbeat.logstash.publish.write_bytes=5923
2017-02-07T11:43:13+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=4600
2017-02-07T11:43:43+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=3899
2017-02-07T11:44:13+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=20515
2017-02-07T11:44:43+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=17123 libbeat.logstash.publish.read_bytes=6
2017-02-07T11:45:02+01:00 ERR Failed to publish events caused by: read tcp 192.168.3.41:52362->172.10.1.250:5044: i/o timeout
2017-02-07T11:45:02+01:00 INFO Error publishing events (retrying): read tcp 192.168.3.41:52362->172.10.1.250:5044: i/o timeout
2017-02-07T11:45:13+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=23181 libbeat.logstash.published_but_not_acked_events=981 libbeat.logstash.publish.read_bytes=6 libbeat.logstash.published_and_acked_events=466
2017-02-07T11:45:43+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=12 libbeat.logstash.publish.write_bytes=31813
2017-02-07T11:46:13+01:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=6 libbeat.logstash.publish.write_bytes=6429
201 
 7-02-07T11:46:43+01:00 INFO Non-zero metrics in the last 30s: fetches.system-fsstat.success=1 fetches.system-load.success=1 fetches.system-process.events=1 libbeat.publisher.messages_in_worker_queues=4 libbeat.logstash.call_count.PublishEvents=3 fetches.system-cpu.events=1 fetches.system-cpu.success=1 libbeat.logstash.publish.read_bytes=18 fetches.system-load.events=1 libbeat.publisher.published_events=4 libbeat.logstash.publish.write_bytes=1259 libbeat.logstash.published_and_acked_events=988 fetches.system-fsstat.events=1

Also, if you need, this is the logstash output when it starts

[2017-02-07T11:56:22,151][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://logstash_system:xxxxxx@localhost:9200/_xpack/monitoring/?system_id=logstash&system_api_version=2&interval=1s]}}
[2017-02-07T11:56:22,155][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://logstash_system:xxxxxx@localhost:9200, :path=>"/"}
[2017-02-07T11:56:22,336][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<URI::HTTP:0x713171fb URL:http://logstash_system:xxxxxx@localhost:9200>}
[2017-02-07T11:56:22,338][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>[#<URI::HTTP:0x21e1784c URL:http://localhost:9200>]}
[2017-02-07T11:56:22,338][INFO ][logstash.pipeline        ] Starting pipeline {"id"=>".monitoring-logstash", "pipeline.workers"=>1, "pipeline.batch.size"=>2, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>2}
[2017-02-07T11:56:22,339][INFO ][logstash.pipeline        ] Pipeline .monitoring-logstash started
[2017-02-07T11:56:23,140][INFO ][logstash.inputs.beats    ] Beats inputs: Starting input listener {:address=>"0.0.0.0:5044"}
[2017-02-07T11:56:23,184][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[2017-02-07T11:56:23,185][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
[2017-02-07T11:56:23,200][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<URI::HTTP:0x5d4a48a8 URL:http://localhost:9200/>}
[2017-02-07T11:56:23,208][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>[#<URI::Generic:0x7a6b7f38 URL://localhost:9200>]}
[2017-02-07T11:56:23,209][INFO ][logstash.pipeline        ] Starting pipeline {"id"=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>500}
[2017-02-07T11:56:23,219][INFO ][logstash.pipeline        ] Pipeline main started
[2017-02-07T11:56:23,282][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}

input vs. output looks okish. At least for the interval monitored.

The logstash logs come after the beats logs. Did you restart logstash?

Have you checked active connections via netstat?

Did you increase the output.logstash.timeout setting to 5 minutes as asked?

Does error persist if you don't restart metricbeat? Normally metricbeat will close the connect, reconnect after some waiting time and retry with a subset of the previous events send.

Yes I restarted logstash.
I'm gonna to check the connectin with netstat

Sorry I miss the output.logstash.timeout configuration. I'll do it.
I cannot find where to set this configuration: in logstash.yml? in metricbeat.yml?

If I don't restart metricbeat the error appear every some minutes

Thanks

the timeout setting must be set in metricbeat.yml, in the output.logstash section.

@steffens Thank you for your time. I really appreciate it.

With the timeout increased I haven't see any other timeout error.
There is something that I isn't clear to me:
If you look at the following picture you can see that the metrics for the specific node called "ellis-1" indexed by elasticsearch are discontinued. I checked the logs file and there aren't errors.

Do you have a possible explanation for this behaviour?

Any timeouts since then?

which logstash-input-beats plugin version is installed? I've seen a user downgrading to 3.1.10 solving the timeout issues for him. Curious if downgrading really helps.

Metricbeat has a rather small default queue size of 1000 events. If events from this queue can not be delivered to logstash in time (e.g. connection errors), the queue might fill up, so no new data will be collected until queue gets cleaned. Plus, by default the publisher drops events after 3 retries. That is, the timeouts or logstash/ES blocking (and generating backpressure) can lead to data being missing.

We're considering to add a disk-backed output queue potentially holding many more events in case of an unresponsive output.

No timeout since then.

Metricbeat: Version: 5.2.0
logstash-input-beats (3.1.12)

I checked the logs of Elasticsearch Logstash and Metricbeat and there aren't errors.

We are evaluating elasticsearch to develop an enterprise software and I really need that all the metrics, collected by metricbeats in each node, are in synchronized in Elasticsearch constantly and in a reasonable time.
I will have to create a script that, every 5 minutes, processes the cluster's metrics ... but if the metrics are not synchronized in these 5 minutes this means That I Can not use Elasticsearch to collect metrics.

Thanks

@faustf In the Metricbeat logs you have every 30s a summary of the changes. Do you see there still not_acked events?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.