System unresponsive with auditbeat running

I've setup Auditbeat to report to an elasticsearch cluster. If the elasticsearch index state is RED auditbeat occasionally makes the whole system unresponsive for 1 minute.

logs:

2018-08-03T12:29:45.951+0300 DEBUG [elasticsearch] elasticsearch/client.go:507 Bulk item insert failed (i=49, status=503): {"type":"unavailable_shards_exception","reason":"[auditd-2018.08.03][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[auditd-2018.08.03][1]] containing [23] requests]"}

Could you share your configuration formatted using </>? Also, if you could add more debug logs it would be useful.

Config file:


  # Glob pattern for configuration reloading
  path: ${path.config}/conf.*.d/*.yml

  # Period on which files under path should be checked for changes
  reload.period: 10s

  # Set to true to enable config reloading
  reload.enabled: enable

setup.template.settings:
  index.number_of_shards: 2

output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["xxx1", "xxx2"]
  index: "auditd-%{+yyyy.MM.dd}"

setup.template.name: "auditd"
setup.template.pattern: "auditd-*"

And some more logs:

2018-08-03T12:41:58.099+0300 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer 2018-08-03T12:41:58.099+0300 INFO [publish] pipeline/retry.go:151 done 2018-08-03T12:41:59.099+0300 ERROR pipeline/output.go:92 Failed to publish events: temporary bulk send failure 2018-08-03T12:41:59.099+0300 DEBUG [elasticsearch] elasticsearch/client.go:666 ES Ping(url=http://xxx1:9200) 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:174 done 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer 2018-08-03T12:41:59.099+0300 INFO [publish] pipeline/retry.go:151 done 2018-08-03T12:41:59.100+0300 DEBUG [elasticsearch] elasticsearch/client.go:689 Ping status code: 200 2018-08-03T12:41:59.101+0300 INFO elasticsearch/client.go:690 Connected to Elasticsearch version 5.4.2 2018-08-03T12:41:59.101+0300 DEBUG [elasticsearch] elasticsearch/client.go:708 HEAD http://xxx1:9200/_template/auditd <nil> 2018-08-03T12:41:59.103+0300 INFO template/load.go:73 Template already exists and will not be overwritten. 2018-08-03T12:41:59.103+0300 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer 2018-08-03T12:41:59.103+0300 INFO [publish] pipeline/retry.go:174 done 2018-08-03T12:42:00.815+0300 INFO [monitoring] log/log.go:124 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":930,"time":{"ms":4}},"total":{"ticks":6450,"time":{"ms":76},"value":6450},"user":{"ticks":5520,"time":{"ms":72}}},"info":{"ephemeral_id":"3e9667e7-a869-4ba6-9125-5840be3f01ff","uptime":{"ms":3360011}},"memstats":{"gc_next":82140992,"memory_alloc":41484232,"memory_total":395598440}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"batches":2,"failed":100,"total":100},"read":{"bytes":2361},"write":{"bytes":147983}},"pipeline":{"clients":2,"events":{"active":4117,"retry":200}}},"system":{"load":{"1":0.01,"15":2.08,"5":0.72,"norm":{"1":0.005,"15":1.04,"5":0.36}}}}}}

You are experiencing the backpressure problem: Auditbeat stops reading audit events from the kernel until it can resume delivering them to Elasticsearch. This causes the kernel's internal queue to fill and causes audited processes to block, making the system unresponsive.

Starting with Auditbeat 6.3.0, new strategies to deal with this problem are implemented, see the backpressure_strategy option:
https://www.elastic.co/guide/en/beats/auditbeat/master/auditbeat-module-auditd.html#_configuration_options_13

There's no need to configure anything. By default, Auditbeat 6.3+ will configure the kernel to not block audited processes when messages cannot be delivered, so the system will never become unresponsive, but audit messages can be lost.

Thanks for your help.
I have

apt-cache policy auditbeat 
auditbeat:
  Installed: 6.3.2
  Candidate: 6.3.2

also kernel version is:

uname -a
Linux icinga 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux

I tried with backpressure_strategy auto and kernel but the problem still exists.

Dummy script for tests:

while true ; do { time cat /var/log/auditbeat/auditbeat >/dev/null; sleep 1 ;} 2>&1 | grep -v 0m0 ; done

output:
real 1m0.467s

EDITED:

I check the source and saw that backpresure setting introduced at 124c8a28f9 which is after 6.3.2 version. I will try to build the debian package with the latest source code and see if it works as expected.

You're completely right, I got confused with the dates. My bad.

The feature will be available in the next release, 6.4.0.

It's great if you can test the feature now. Please update with the results or ask if you have any problems building the packages.

I build the package with:
make snapshot
And test it. It seems to work as expected. Thank you very much for your support.
Any plans for releasing 6.4.0 version?

It will likely be released later this month.