I've setup Auditbeat to report to an elasticsearch cluster. If the elasticsearch index state is RED auditbeat occasionally makes the whole system unresponsive for 1 minute.
logs:
2018-08-03T12:29:45.951+0300 DEBUG [elasticsearch] elasticsearch/client.go:507 Bulk item insert failed (i=49, status=503): {"type":"unavailable_shards_exception","reason":"[auditd-2018.08.03][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[auditd-2018.08.03][1]] containing [23] requests]"}
# Glob pattern for configuration reloading
path: ${path.config}/conf.*.d/*.yml
# Period on which files under path should be checked for changes
reload.period: 10s
# Set to true to enable config reloading
reload.enabled: enable
setup.template.settings:
index.number_of_shards: 2
output.elasticsearch:
# Array of hosts to connect to.
hosts: ["xxx1", "xxx2"]
index: "auditd-%{+yyyy.MM.dd}"
setup.template.name: "auditd"
setup.template.pattern: "auditd-*"
And some more logs:
2018-08-03T12:41:58.099+0300 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer 2018-08-03T12:41:58.099+0300 INFO [publish] pipeline/retry.go:151 done 2018-08-03T12:41:59.099+0300 ERROR pipeline/output.go:92 Failed to publish events: temporary bulk send failure 2018-08-03T12:41:59.099+0300 DEBUG [elasticsearch] elasticsearch/client.go:666 ES Ping(url=http://xxx1:9200) 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:174 done 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:151 done 2018-08-03T12:41:59.100+0300 DEBUG [elasticsearch] elasticsearch/client.go:689 Ping status code: 200 2018-08-03T12:41:59.101+0300 INFO elasticsearch/client.go:690 Connected to Elasticsearch version 5.4.2 2018-08-03T12:41:59.101+0300 DEBUG [elasticsearch] elasticsearch/client.go:708 HEAD http://xxx1:9200/_template/auditd <nil> 2018-08-03T12:41:59.103+0300 INFO template/load.go:73 Template already exists and will not be overwritten. 2018-08-03T12:41:59.103+0300 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer 2018-08-03T12:41:59.103+0300 INFO [publish] pipeline/retry.go:174 done 2018-08-03T12:42:00.815+0300 INFO [monitoring] log/log.go:124 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":930,"time":{"ms":4}},"total":{"ticks":6450,"time":{"ms":76},"value":6450},"user":{"ticks":5520,"time":{"ms":72}}},"info":{"ephemeral_id":"3e9667e7-a869-4ba6-9125-5840be3f01ff","uptime":{"ms":3360011}},"memstats":{"gc_next":82140992,"memory_alloc":41484232,"memory_total":395598440}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"batches":2,"failed":100,"total":100},"read":{"bytes":2361},"write":{"bytes":147983}},"pipeline":{"clients":2,"events":{"active":4117,"retry":200}}},"system":{"load":{"1":0.01,"15":2.08,"5":0.72,"norm":{"1":0.005,"15":1.04,"5":0.36}}}}}}
You are experiencing the backpressure problem: Auditbeat stops reading audit events from the kernel until it can resume delivering them to Elasticsearch. This causes the kernel's internal queue to fill and causes audited processes to block, making the system unresponsive.
There's no need to configure anything. By default, Auditbeat 6.3+ will configure the kernel to not block audited processes when messages cannot be delivered, so the system will never become unresponsive, but audit messages can be lost.
uname -a
Linux icinga 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
I tried with backpressure_strategyauto and kernel but the problem still exists.
Dummy script for tests:
while true ; do { time cat /var/log/auditbeat/auditbeat >/dev/null; sleep 1 ;} 2>&1 | grep -v 0m0 ; done
output: real 1m0.467s
EDITED:
I check the source and saw that backpresure setting introduced at 124c8a28f9 which is after 6.3.2 version. I will try to build the debian package with the latest source code and see if it works as expected.
I build the package with: make snapshot
And test it. It seems to work as expected. Thank you very much for your support.
Any plans for releasing 6.4.0 version?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.