Filebeat Memory Leakage Investigation

Hi,

Filebeat version 5.2.0
OS: Centos 6.6 and 7.2
Configuration (cfr second post)

On our production environment I'm investigating memory leaking in the filebeat process.

I've gotten so far to deduce (using go tool pprof) that the extra memory consumption is all located in harvester.Harvest function, in big blobs (see list output below).

  1. Might this be a config "error"
  2. How would I proceed to investigate the root cause ?

Kind regards,

Arne


pprof output:

$ go tool pprof <exe_path> 
http://<pprof_listener>/debug/pprof/heap
Fetching profile from http://<pprof_listener>/debug/pprof/heap
Saved profile in <profile_save_path>
Entering interactive mode (type "help" for commands)
(pprof) top
967.37MB of 972.89MB total (99.43%)
Dropped 250 nodes (cum <= 4.86MB)
Showing top 10 nodes out of 21 (cum >= 28.53MB)
      flat  flat%   sum%        cum   cum%
  917.22MB 94.28% 94.28%   917.22MB 94.28%  github.com/elastic/beats/filebeat/harvester.(*Harvester).Harvest
   23.80MB  2.45% 96.72%    28.03MB  2.88%  compress/flate.NewWriter
   17.10MB  1.76% 98.48%    17.10MB  1.76%  github.com/elastic/beats/filebeat/spooler.(*Spooler).flush
    8.75MB   0.9% 99.38%     8.75MB   0.9%  github.com/elastic/beats/vendor/github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
    0.50MB 0.052% 99.43%    28.53MB  2.93%  github.com/elastic/beats/vendor/github.com/Shopify/sarama.encode
         0     0% 99.43%    28.03MB  2.88%  compress/gzip.(*Writer).Write
         0     0% 99.43%   917.22MB 94.28%  github.com/elastic/beats/filebeat/prospector.(*Prospector).startHarvester.func1
         0     0% 99.43%    17.10MB  1.76%  github.com/elastic/beats/filebeat/spooler.(*Spooler).run
         0     0% 99.43%    28.53MB  2.93%  github.com/elastic/beats/vendor/github.com/Shopify/sarama.(*Broker).Produce
         0     0% 99.43%    28.53MB  2.93%  github.com/elastic/beats/vendor/github.com/Shopify/sarama.(*Broker).send
(pprof) list Harvest
Total: 972.89MB
ROUTINE ======================== github.com/elastic/beats/filebeat/harvester.(*Harvester).Harvest in /home/ademmers/gopath/src/github.com/elastic/beats/filebeat/harvester/log.go
  917.22MB   917.22MB (flat, cum) 94.28% of Total
         .          .     88:           case <-h.done:
         .          .     89:                   return
         .          .     90:           default:
         .          .     91:           }
         .          .     92:
         .          .     93:           message, err := r.Next()
         .          .     94:           if err != nil {
         .          .     95:                   switch err {
         .          .     96:                   case ErrFileTruncate:
         .          .     97:                           logp.Info("File was truncated. Begin reading file from offset 0: %s", h.state.Source)
         .          .     98:                           h.state.Offset = 0
         .          .     99:                           filesTruncated.Add(1)
         .          .    100:                   case ErrRemoved:
         .          .    101:                           logp.Info("File was removed: %s. Closing because close_removed is enabled.", h.state.Source)
         .          .    102:                   case ErrRenamed:
         .          .    103:                           logp.Info("File was renamed: %s. Closing because close_renamed is enabled.", h.state.Source)
         .          .    104:                   case ErrClosed:
         .          .    105:                           logp.Info("Reader was closed: %s. Closing.", h.state.Source)
         .          .    106:                   case io.EOF:
         .          .    107:                           logp.Info("End of file reached: %s. Closing because close_eof is enabled.", h.state.Source)
         .          .    108:                   case ErrInactive:
         .          .    109:                           logp.Info("File is inactive: %s. Closing because close_inactive of %v reached.", h.state.Source, h.config.CloseInactive)
         .          .    110:                   default:
         .          .    111:                           logp.Err("Read line error: %s; File: ", err, h.state.Source)
         .          .    112:                   }
         .          .    113:                   return
         .          .    114:           }
         .          .    115:
         .          .    116:           // Strip UTF-8 BOM if beginning of file
         .          .    117:           // As all BOMS are converted to UTF-8 it is enough to only remove this one
         .          .    118:           if h.state.Offset == 0 {
         .          .    119:                   message.Content = bytes.Trim(message.Content, "\xef\xbb\xbf")
         .          .    120:           }
         .          .    121:
         .          .    122:           // Update offset
         .          .    123:           h.state.Offset += int64(message.Bytes)
         .          .    124:
         .          .    125:           // Create state event
  546.12MB   546.12MB    126:           event := input.NewEvent(h.getState())
         .          .    127:
  371.11MB   371.11MB    128:           text := string(message.Content)
         .          .    129:
         .          .    130:           // Check if data should be added to event. Only export non empty events.
         .          .    131:           if !message.IsEmpty() && h.shouldExportLine(text) {
         .          .    132:                   event.ReadTime = message.Ts
         .          .    133:                   event.Bytes = message.Bytes
ROUTINE ======================== github.com/elastic/beats/filebeat/prospector.(*Prospector).startHarvester.func1 in /home/ademmers/gopath/src/github.com/elastic/beats/filebeat/prospector/prospector.go
         0   917.22MB (flat, cum) 94.28% of Total
         .          .    243:                   atomic.AddUint64(&p.harvesterCounter, ^uint64(0))
         .          .    244:                   p.wg.Done()
         .          .    245:           }()
         .          .    246:
         .          .    247:           // Starts harvester and picks the right type. In case type is not set, set it to defeault (log)
         .   917.22MB    248:           h.Harvest(reader)
         .          .    249:   }()
         .          .    250:
         .          .    251:   return nil
         .          .    252:}

Configuration :

  • /etc/filebeat/filebeat.yml
filebeat.config_dir: /etc/filebeat/conf.d
filebeat.publish_async: true
filebeat.shutdown_timeout: 10s
filebeat.spool_size: 65536
logging.files:
    keepfiles: 7
    rotateeverybytes: 10485760
logging.level: info
logging.to_files: true
logging.to_syslog: false
output.kafka:
    bulk_max_size: 2048
    enabled: true
    hosts:
    - <list_of_3_kafka_nodes>
    keep_alive: 180s
    timeout: 30s
    topic: <kafka_topic>
    version: '0.10'
    worker: 8
  • /etc/filebeat/conf.d/service_1.yml
filebeat.prospectors:
  - paths: [ <java_based_app_log_running_on_tomcat7_unsing_log4j2> ]
    document_type: tomcatlog
    input_type: log
    fields:
      type: service
      subtype: <service_1_reference>
      application: <service_1_name>
      env: <runtime_environment>
    harvester_limit: 10
    ignore_older: 3h
    close_inactive: 2h
    clean_inactive: 4h
    close_removed: true
    clean_removed: true
    close_renamed: true
    close_timeout: 2h
    multiline:
      pattern: '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}'
      negate: true
      match: after

  - paths: [ <tomcat_access_log> ]
    document_type: accesslog
    input_type: log
    fields:
      type: access-log
      subtype: <service_1_reference>
      application: <service_1_name>
      env: <runtime_environment>
    harvester_limit: 10
    ignore_older: 3h
    close_inactive: 2h
    clean_inactive: 4h
    close_removed: true
    clean_removed: true
    close_renamed: true
    close_timeout: 2h

the problem is with filebeat.publish_async: true.

ok, thanks for the quick reply. I will reconfigure several nodes and observe their behaviour over the comming days.

Still though, I can understand perfectly why you could get into situations where having async publishes to grow bigger memory usage when busy/while waiting for acks, but shouldn't golang gc clean out that extra memory when the acks are received ?
In other words: doesn't this sound like a memory leak ? (or is this a known leakage issue which I missed during searching for older issues ?)

well, it's known memory-leak create by by unfortunate interactions between multiple outputs, load-balancing, buffering for async, ordered ACK requirement in filebeat and potentially error handling in outputs. Batches must be ACKed in order, but are send out-of-order... this can lead to the output pipeline turning into a super big buffer never being cleaned... or eventually being cleaned (but not before OOM)...

ok, got it. Will live with non-async for now.

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.