System unresponsive with auditbeat running

ppanagiotis · August 3, 2018, 9:36am

I've setup Auditbeat to report to an elasticsearch cluster. If the elasticsearch index state is RED auditbeat occasionally makes the whole system unresponsive for 1 minute.

logs:

2018-08-03T12:29:45.951+0300 DEBUG [elasticsearch] elasticsearch/client.go:507 Bulk item insert failed (i=49, status=503): {"type":"unavailable_shards_exception","reason":"[auditd-2018.08.03][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[auditd-2018.08.03][1]] containing [23] requests]"}

kvch · August 3, 2018, 9:41am

Could you share your configuration formatted using </>? Also, if you could add more debug logs it would be useful.

ppanagiotis · August 3, 2018, 9:52am

Config file:


  # Glob pattern for configuration reloading
  path: ${path.config}/conf.*.d/*.yml

  # Period on which files under path should be checked for changes
  reload.period: 10s

  # Set to true to enable config reloading
  reload.enabled: enable

setup.template.settings:
  index.number_of_shards: 2

output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["xxx1", "xxx2"]
  index: "auditd-%{+yyyy.MM.dd}"

setup.template.name: "auditd"
setup.template.pattern: "auditd-*"

And some more logs:

2018-08-03T12:41:58.099+0300 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer 2018-08-03T12:41:58.099+0300 INFO [publish] pipeline/retry.go:151 done 2018-08-03T12:41:59.099+0300 ERROR pipeline/output.go:92 Failed to publish events: temporary bulk send failure 2018-08-03T12:41:59.099+0300 DEBUG [elasticsearch] elasticsearch/client.go:666 ES Ping(url=http://xxx1:9200) 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:174 done 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:151 done 2018-08-03T12:41:59.100+0300 DEBUG [elasticsearch] elasticsearch/client.go:689 Ping status code: 200 2018-08-03T12:41:59.101+0300 INFO elasticsearch/client.go:690 Connected to Elasticsearch version 5.4.2 2018-08-03T12:41:59.101+0300 DEBUG [elasticsearch] elasticsearch/client.go:708 HEAD http://xxx1:9200/_template/auditd <nil> 2018-08-03T12:41:59.103+0300 INFO template/load.go:73 Template already exists and will not be overwritten. 2018-08-03T12:41:59.103+0300 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer 2018-08-03T12:41:59.103+0300 INFO [publish] pipeline/retry.go:174 done 2018-08-03T12:42:00.815+0300 INFO [monitoring] log/log.go:124 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":930,"time":{"ms":4}},"total":{"ticks":6450,"time":{"ms":76},"value":6450},"user":{"ticks":5520,"time":{"ms":72}}},"info":{"ephemeral_id":"3e9667e7-a869-4ba6-9125-5840be3f01ff","uptime":{"ms":3360011}},"memstats":{"gc_next":82140992,"memory_alloc":41484232,"memory_total":395598440}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"batches":2,"failed":100,"total":100},"read":{"bytes":2361},"write":{"bytes":147983}},"pipeline":{"clients":2,"events":{"active":4117,"retry":200}}},"system":{"load":{"1":0.01,"15":2.08,"5":0.72,"norm":{"1":0.005,"15":1.04,"5":0.36}}}}}}

adrisr · August 3, 2018, 10:06am

You are experiencing the backpressure problem: Auditbeat stops reading audit events from the kernel until it can resume delivering them to Elasticsearch. This causes the kernel's internal queue to fill and causes audited processes to block, making the system unresponsive.

Starting with Auditbeat 6.3.0, new strategies to deal with this problem are implemented, see the backpressure_strategy option:
https://www.elastic.co/guide/en/beats/auditbeat/master/auditbeat-module-auditd.html#_configuration_options_13

There's no need to configure anything. By default, Auditbeat 6.3+ will configure the kernel to not block audited processes when messages cannot be delivered, so the system will never become unresponsive, but audit messages can be lost.

ppanagiotis · August 3, 2018, 10:35am

Thanks for your help.
I have

apt-cache policy auditbeat 
auditbeat:
  Installed: 6.3.2
  Candidate: 6.3.2

also kernel version is:

uname -a
Linux icinga 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux

I tried with backpressure_strategy auto and kernel but the problem still exists.

Dummy script for tests:

while true ; do { time cat /var/log/auditbeat/auditbeat >/dev/null; sleep 1 ;} 2>&1 | grep -v 0m0 ; done

output:
real 1m0.467s

EDITED:

I check the source and saw that backpresure setting introduced at 124c8a28f9 which is after 6.3.2 version. I will try to build the debian package with the latest source code and see if it works as expected.

adrisr · August 3, 2018, 12:00pm

You're completely right, I got confused with the dates. My bad.

The feature will be available in the next release, 6.4.0.

It's great if you can test the feature now. Please update with the results or ask if you have any problems building the packages.

ppanagiotis · August 3, 2018, 1:04pm

I build the package with:
make snapshot
And test it. It seems to work as expected. Thank you very much for your support.
Any plans for releasing 6.4.0 version?

andrewkroh · August 3, 2018, 2:54pm

It will likely be released later this month.

Topic		Replies	Views
Auditbeat 7.8.0 stuck: process running but no data sent to Elasticsearch Beats auditbeat	1	344	December 11, 2020
Bulk item insert failed Elasticsearch	3	2709	January 4, 2017
Auditbeat not work in Ubuntu 22.04 after a reboot or shutdown Beats auditbeat	6	17	November 1, 2024
Metricbeat Index turns red every hours Elasticsearch	6	1017	September 29, 2017
Cluster red, unassigned shards, no response on writes Elasticsearch	17	691	February 22, 2023

System unresponsive with auditbeat running

Related topics