Data loss frequently

we have more than 135 filebeat running on different instance. each sending data to more than 10 pipeline.

getting error on most of the filebeat(Failed to publish events: temporary bulk send failure):

2018-09-04T17:39:22.326+0530 INFO elasticsearch/client.go:690 Connected to Elasticsearch version 6.3.1
2018-09-04T17:39:22.331+0530 INFO template/load.go:73 Template already exists and will not be overwritten.
2018-09-04T17:39:22.331+0530 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer
2018-09-04T17:39:22.331+0530 INFO [publish] pipeline/retry.go:174 done
2018-09-04T17:39:22.341+0530 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer
2018-09-04T17:39:22.341+0530 INFO [publish] pipeline/retry.go:151 done
2018-09-04T17:39:22.973+0530 INFO [monitoring] log/log.go:124 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":150580,"time":{"ms":3}},"total":{"ticks":1106600,"time":{"ms":75},"value":1106600},"user":{"ticks":956020,"time":{"ms":72}}},"info":{"ephemeral_id":"c0c77725-d7ec-4d04-9778-6c3e87caf483","uptime":{"ms":271440046}},"memstats":{"gc_next":20433952,"memory_alloc":18759592,"memory_total":81088606696}},"filebeat":{"harvester":{"open_files":10,"running":10}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":29,"failed":89,"total":89},"read":{"bytes":30628},"write":{"bytes":53426}},"pipeline":{"clients":96,"events":{"active":4126,"retry":178}}},"registrar":{"states":{"current":11}},"system":{"load":{"1":0.73,"15":0.21,"5":0.27,"norm":{"1":0.0913,"15":0.0263,"5":0.0338}}},"xpack":{"monitoring":{"pipeline":{"events":{"published":3,"total":3},"queue":{"acked":3}}}}}}}
2018-09-04T17:39:23.341+0530 ERROR pipeline/output.go:92 Failed to publish events: temporary bulk send failure
2018-09-04T17:39:23.341+0530 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer
2018-09-04T17:39:23.341+0530 INFO [publish] pipeline/retry.go:174 done
2018-09-04T17:39:23.341+0530 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer
2018-09-04T17:39:23.341+0530 INFO [publish] pipeline/retry.go:151 done
2018-09-04T17:39:23.342+0530 INFO elasticsearch/client.go:690 Connected to Elasticsearch version 6.3.1
2018-09-04T17:39:23.344+0530 INFO template/load.go:73 Template already exists and will not be overwritten.
2018-09-04T17:39:23.344+0530 INFO [publish] pipeline/retry.go:172 retryer: send unwait-signal to consumer
2018-09-04T17:39:23.344+0530 INFO [publish] pipeline/retry.go:174 done
2018-09-04T17:39:23.346+0530 INFO [publish] pipeline/retry.go:149 retryer: send wait signal to consumer
2018-09-04T17:39:23.346+0530 INFO [publish] pipeline/retry.go:151 done

It seems like Elasticsearch cannot handle the load. What do you see in logs of ES?

To minimize data loss, you could use spooling to disk. It's available since 6.3. https://www.elastic.co/guide/en/beats/filebeat/6.3/configuring-internal-queue.html#configuration-internal-queue-spool
It stores the messages in a file until they can be forwarded to the output.
You could also increase the size of the mem queue. But obviously, it uses memory. So you could opt for spooling to disk.

will spooling disk slow down logging in kibana. and it is in beta version will not recommended to use in production

i am also getting gc overhead on elastic search cluster on each instance.

instance memory 60
heap size ES :32
queue size:7000
thread pool:write

Please help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.