Filebeat lost data when elasticsearch is down

anson_ch · June 7, 2018, 3:12am

I have experienced a lost data issue with filebeat ----> elasticsearch.
when one pod of elasticsearch cluster is down, i found that the filebeat had lost log data.

The related configuration is as follows :
elasticsearch version : 6.1.2
filebeat version: 6.1.1
filebeat.yml configuration:

filebeat.prospectors:
- type: log
  enabled: true
  paths:
    - "/var/log/containers/${_SYSTEM_POD_NAME}_${_SYSTEM_NAMESPACE}_${_SYSTEM_APP_NAME}-*.log"
  symlinks: true
  scan_frequency: 1s
  fields:
    source: ${_SYSTEM_APP_NAME:default}
    host: ${HOSTNAME:default}
    tags:
      - ${_SYSTEM_NAMESPACE:default}
  json.keys_under_root: true
  json.add_error_key: true
  json.message_key: log
  multiline.pattern: ${_SYSTEM_MULTILINE_PATTERN}
  multiline.negate: true
  multiline.match: after
  fields_under_root: true
  pipeline: ${_SYSTEM_PIPELINE}

output.elasticsearch:
  hosts: ["elasticsearch-logging.dopware:9200"]
  worker: 4
  enabled: true
  timeout: 300
  bulk_max_size: 100

and the filebeat log as is bellow

2018/06/07 02:51:40.758011 harvester.go:215: INFO Harvester started for file: /var/log/containers/p1-5c88559c4f-tcsjr_bill-pre_p1-0bb9d1d0bfb2d014c477bb5960c7667e29648d09673f6b5dea3528a229178a45.log
2018/06/07 02:51:41.767124 harvester.go:215: INFO Harvester started for file: /var/log/containers/p1-5c88559c4f-tcsjr_bill-pre_p1-0bb9d1d0bfb2d014c477bb5960c7667e29648d09673f6b5dea3528a229178a45.log
2018/06/07 02:51:42.797308 harvester.go:215: INFO Harvester started for file: /var/log/containers/p1-5c88559c4f-tcsjr_bill-pre_p1-0bb9d1d0bfb2d014c477bb5960c7667e29648d09673f6b5dea3528a229178a45.log
2018/06/07 02:51:44.388487 client.go:276: ERR Failed to perform any bulk index operations: Post http://elasticsearch-logging.dopware:9200/_bulk: EOF
2018/06/07 02:51:44.390459 client.go:276: ERR Failed to perform any bulk index operations: Post http://elasticsearch-logging.dopware:9200/_bulk: EOF
2018/06/07 02:51:45.388701 output.go:92: ERR Failed to publish events: Post http://elasticsearch-logging.dopware:9200/_bulk: EOF
2018/06/07 02:51:45.390669 output.go:92: ERR Failed to publish events: Post http://elasticsearch-logging.dopware:9200/_bulk: EOF
2018/06/07 02:51:47.632400 output.go:74: ERR Failed to connect: 503 Service Unavailable: {
  "name" : "elasticsearch-logging-v1-mtgnl",
  "cluster_name" : "kubernetes-logging",
  "cluster_uuid" : "xI_7pWsuTaCuVuWjjTyurg",
  "version" : {
    "number" : "6.1.2",
    "build_hash" : "5b1fea5",
    "build_date" : "2018-01-10T02:35:59.208Z",
    "build_snapshot" : false,
    "lucene_version" : "7.1.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
2018/06/07 02:51:47.635750 output.go:74: ERR Failed to connect: 503 Service Unavailable: {
  "name" : "elasticsearch-logging-v1-mtgnl",
  "cluster_name" : "kubernetes-logging",
  "cluster_uuid" : "xI_7pWsuTaCuVuWjjTyurg",
  "version" : {
    "number" : "6.1.2",
    "build_hash" : "5b1fea5",
    "build_date" : "2018-01-10T02:35:59.208Z",
    "build_snapshot" : false,
    "lucene_version" : "7.1.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
2018/06/07 02:51:51.478984 metrics.go:39: INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=30000 beat.memstats.gc_next=34604992 beat.memstats.memory_alloc=18190248 beat.memstats.memory_total=14279169192 filebeat.events.active=2 filebeat.events.added=17690 filebeat.events.done=17688 filebeat.harvester.open_files=39 filebeat.harvester.running=38 filebeat.harvester.started=3 libbeat.config.module.running=0 libbeat.output.read.bytes=217078 libbeat.output.read.errors=2 libbeat.output.write.bytes=13305478 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=4117 libbeat.pipeline.events.filtered=3 libbeat.pipeline.events.published=17684 libbeat.pipeline.events.retry=500 libbeat.pipeline.events.total=17687 libbeat.pipeline.queue.acked=17684 registrar.states.current=38 registrar.states.update=17688 registrar.writes=88
2018/06/07 02:52:15.649178 output.go:74: ERR Failed to connect: 503 Service Unavailable: {
  "name" : "elasticsearch-logging-v1-mtgnl",
  "cluster_name" : "kubernetes-logging",
  "cluster_uuid" : "xI_7pWsuTaCuVuWjjTyurg",
  "version" : {
    "number" : "6.1.2",
    "build_hash" : "5b1fea5",
    "build_date" : "2018-01-10T02:35:59.208Z",
    "build_snapshot" : false,
    "lucene_version" : "7.1.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
2018/06/07 02:52:15.650109 client.go:651: INFO Connected to Elasticsearch version 6.1.2
2018/06/07 02:52:15.656559 client.go:651: INFO Connected to Elasticsearch version 6.1.2
2018/06/07 02:52:15.661142 load.go:73: INFO Template already exists and will not be overwritten.
2018/06/07 02:52:15.679937 load.go:73: INFO Template already exists and will not be overwritten.
2018/06/07 02:52:16.110273 harvester.go:215: INFO Harvester started for file: /var/log/containers/p1-5c88559c4f-tcsjr_bill-pre_p1-0bb9d1d0bfb2d014c477bb5960c7667e29648d09673f6b5dea3528a229178a45.log

What is the retry policy of filebeat ? i know the send-at-least once sementics, but it looks like not take affect about this.

Thank you very much!

adrisr · June 7, 2018, 10:07am

Filebeat by default will keep retrying sending the events until it receives an ACK, so network failure can cause an event to be delivered more than once, but never to be lost. That's what at-least-once means in this case.

But what can be happening here, is that some file is rotated/removed while filebeat is not processing any new files because its publish pipeline is full due to the output not being available. Does that sound like something that could've happened in your case?

BTW, have a look at this upcoming feature, it will make beats much more resilent to situations like this one

anson_ch · June 7, 2018, 10:55am

@adrisr thank you very much for your reply!
in my case, the filebeat harvest logs from log file, which called json-file in docker. And the docker config about json-file is as follows

--log-opt max-size=30m --log-opt max-file=3

and the TPS is 18000 by accessing the app which produced lots of logs.

but i don't understand even if its publish pipeline is full, the new file can start up new harvester for those new file. why this case can be produced?

adrisr · June 7, 2018, 1:39pm

The in-memory queue is bounded, so at one point filebeat has to stop processing new log sources, otherwise the queue will grow too much.

To avoid this situation you can increase the number of events that the queue can store, see https://www.elastic.co/guide/en/beats/filebeat/current/configuring-internal-queue.html

Also you can upgrade to 6.3 once it's released and have a disk-backed queue that is not bound by memory size.

anson_ch · June 8, 2018, 2:14am

thank you again for your explain！
i think i can try it by upgrade the num of queue.mem.events in filebeat.yml.
but in my case, how can i confirm this situation is due to in-memory queue is bounded causes? if i open the debug option, it will produce lots of logs in my console. In that case, it's not easy to find key information to confirm the cause.

system · July 6, 2018, 2:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data loss prevention? Beats filebeat	8	1454	February 18, 2021
Filebeat Publish Issue Beats filebeat	4	426	December 13, 2018
Filebeat dropping events Beats filebeat	7	2611	June 14, 2021
Does Filebeat take in logs missed when it is stopped? Beats filebeat	4	500	June 2, 2021
Losing logs in ELK Elasticsearch	3	1792	July 7, 2021

Filebeat lost data when elasticsearch is down

Related topics