Filebeat is not indexing all the logs


(Carolina Baldino) #1

Hello,

I have Elastic Stack (ElasticSearch + Kibana + Logstash + Filebeat) up and running in production, and every module is running in docker, with the officials images provided by elastic.

Filebeat instance is connected and ingesting data about docker logs (from another 'dockerized' app that serves a web plataform) to Elasticsearch. I have been monitoring and adjusting configurations for a couple of weeks and everything seems fine so far..

The thing is that the web platform prints several log lines per minute, and the problem is that I don't know why but filebeat (or Elasticsearch) is not indexing all the logs, mostly the old ones.
So I have one index per day, and as you can see, if I leave a few days everything just like this, docs numbers doesn't increase much, or even worse, doesn't increase at all (for example index 25.01.2019 or index 21.01.2019 that is not even indexed):

I don't get to trust if the ingestion is working fine.. I don't have any errors logs in the elastic modules, only some isolated overhead ones in elasticsearch.

[2019-02-06T10:08:41,292][INFO ][o.e.m.j.JvmGcMonitorService] [QyJqSRr] [gc][394787] overhead, spent [403ms] collecting in the last [1s]
[2019-02-06T13:06:30,535][INFO ][o.e.m.j.JvmGcMonitorService] [QyJqSRr] [gc][405443] overhead, spent [430ms] collecting in the last [1s]
[2019-02-06T13:39:32,686][INFO ][o.e.m.j.JvmGcMonitorService] [QyJqSRr] [gc][407422] overhead, spent [328ms] collecting in the last [1s]

(The specs of the server I am running the stack are: 2 GB RAM - 1 CPU - 50 GB Disk)

I have tried thing such as cleaning indices cache, or deleting the indices and let them re-create again.. but nothing worked so far.

I leave here my filebeat.yml conf, which is the one that I consider relevant for this problem. I was also wondering if there are any best practices with the configs to try to get all the logs (such as the time set in close_inactive and scan_frequency)

filebeat.inputs:
- type: docker
  close_inactive: 5m
  scan_frequency: 10s
  combine_partial: true
  multiline.pattern: '^.,'
  multiline.negate: true
  multiline.match: after
  containers:
    ids: 
      - '*'
    json.overwrite_keys: true

setup.template.enabled: true
setup.template.name: "filebeat"
setup.template.pattern: "filebeat-*"
    
setup.kibana:
  host: <host>

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

output.logstash:
  hosts: <host>

xpack.monitoring:
  enabled: true
  elasticsearch:
    hosts: <host>

Monitoring screen:

Here is a sample of logs timestamps in less that a minute (I deleted log lines for privacy reasons) so you can have an aprox amount of logs per minute:

I, [2019-01-26T20:39:03.328611 #54]   
I, [2019-01-26T20:39:04.167190 #51]   
I, [2019-01-26T20:39:04.911602 #51]   
I, [2019-01-26T20:39:04.917009 #51]   
I, [2019-01-26T20:39:04.917127 #51]   
D, [2019-01-26T20:39:04.922000 #51]  
D, [2019-01-26T20:39:04.926078 #51]  
D, [2019-01-26T20:39:04.927655 #51]  
D, [2019-01-26T20:39:04.934731 #51]    
D, [2019-01-26T20:39:04.938701 #51]  
D, [2019-01-26T20:39:04.946900 #51] 
D, [2019-01-26T20:39:04.948744 #51] 
D, [2019-01-26T20:39:04.952247 #51]  
D, [2019-01-26T20:39:04.966808 #51]  
D, [2019-01-26T20:39:04.980669 #51]  
D, [2019-01-26T20:39:04.988168 #51]    
D, [2019-01-26T20:39:04.996505 #51]  
I, [2019-01-26T20:39:05.599940 #54]    
I, [2019-01-26T20:39:05.723844 #51]   
I, [2019-01-26T20:39:13.443756 #51]    
I, [2019-01-26T20:39:14.172979 #54]   
I, [2019-01-26T20:39:15.605774 #51]   
I, [2019-01-26T20:39:23.560680 #54]   
I, [2019-01-26T20:39:24.179016 #51]   
I, [2019-01-26T20:39:25.612253 #54]   
I, [2019-01-26T20:39:28.791610 #51]   
I, [2019-01-26T20:39:28.795650 #51]   
D, [2019-01-26T20:39:28.800424 #51]  
D, [2019-01-26T20:39:28.803388 #51]  
D, [2019-01-26T20:39:28.806776 #51]  
D, [2019-01-26T20:39:28.810037 #51]  
D, [2019-01-26T20:39:28.812938 #51]  
D, [2019-01-26T20:39:28.816084 #51]  
D, [2019-01-26T20:39:28.819838 #51]  
D, [2019-01-26T20:39:28.823770 #51]  
D, [2019-01-26T20:39:28.826639 #51]  
D, [2019-01-26T20:39:28.830148 #51]  
D, [2019-01-26T20:39:28.833498 #51]  
D, [2019-01-26T20:39:28.838195 #51]   
D, [2019-01-26T20:39:28.840348 #51]  
D, [2019-01-26T20:39:28.842862 #51]   
I, [2019-01-26T20:39:28.844530 #51]   
I, [2019-01-26T20:39:28.845306 #51]   
D, [2019-01-26T20:39:28.846696 #51]  
D, [2019-01-26T20:39:28.849102 #51]   
D, [2019-01-26T20:39:28.852300 #51]  
D, [2019-01-26T20:39:28.855567 #51] 
D, [2019-01-26T20:39:28.859742 #51]  
D, [2019-01-26T20:39:28.861523 #51]  
D, [2019-01-26T20:39:28.863520 #51]  
D, [2019-01-26T20:39:28.865957 #51]  
D, [2019-01-26T20:39:28.867419 #51]  
D, [2019-01-26T20:39:28.869776 #51]  
D, [2019-01-26T20:39:28.871475 #51]  
D, [2019-01-26T20:39:28.873308 #51]  
D, [2019-01-26T20:39:28.875566 #51]  
D, [2019-01-26T20:39:28.877220 #51]  
I, [2019-01-26T20:39:28.879085 #51]   
D, [2019-01-26T20:39:28.883421 #51]  
D, [2019-01-26T20:39:28.888947 #51]  
D, [2019-01-26T20:39:28.892591 #51]   
D, [2019-01-26T20:39:28.938300 #51]  
D, [2019-01-26T20:39:28.941294 #51]  
D, [2019-01-26T20:39:28.943258 #51]  
D, [2019-01-26T20:39:28.945629 #51]   
D, [2019-01-26T20:39:28.947309 #51]  
D, [2019-01-26T20:39:28.949648 #51]   
D, [2019-01-26T20:39:28.951606 #51]  
D, [2019-01-26T20:39:28.953524 #51]  
D, [2019-01-26T20:39:28.956036 #51]  
D, [2019-01-26T20:39:28.957950 #51]  
D, [2019-01-26T20:39:28.959591 #51]  
D, [2019-01-26T20:39:28.961913 #51] 
D, [2019-01-26T20:39:28.963721 #51]
D, [2019-01-26T20:39:28.965886 #51]  
D, [2019-01-26T20:39:28.967896 #51]  
D, [2019-01-26T20:39:28.969675 #51] 
D, [2019-01-26T20:39:28.972071 #51]  
D, [2019-01-26T20:39:28.973690 #51] 
D, [2019-01-26T20:39:28.975851 #51] 
D, [2019-01-26T20:39:28.977380 #51] 
D, [2019-01-26T20:39:28.979141 #51] 
D, [2019-01-26T20:39:28.981362 #51] 
D, [2019-01-26T20:39:28.982992 #51] 
D, [2019-01-26T20:39:28.984326 #51] 
D, [2019-01-26T20:39:28.986509 #51]  
D, [2019-01-26T20:39:28.988108 #51]

So please, could you help me to find out what is the problem or miss configuration that it's making me lose the logs?

Thanks


(Pier-Hugues Pellerin) #2

This is unusual, did you look at the filebeat logs and the Logstash log to see if there is some errors, maybe connectivity or parsing errors?


(Carolina Baldino) #3

Hello Pier,

Filebeat logs shows nothing but this, which just monitoring info:

2019-02-07T14:19:56.368Z	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":73820,"time":{"ms":15}},"total":{"ticks":495420,"time":{"ms":208},"value":495420},"user":{"ticks":421600,"time":{"ms":193}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":11},"info":{"ephemeral_id":"3e5264d3-a191-41ea-ab02-1fa56d56c2d6","uptime":{"ms":80310101}},"memstats":{"gc_next":9215456,"memory_alloc":6702504,"memory_total":79485974376,"rss":-516096}},"filebeat":{"events":{"active":-141,"added":937,"done":1078},"harvester":{"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":1078,"batches":14,"total":1078},"read":{"bytes":84},"write":{"bytes":102271}},"pipeline":{"clients":1,"events":{"active":1,"published":937,"total":937},"queue":{"acked":1078}}},"registrar":{"states":{"current":5,"update":1078},"writes":{"success":14,"total":14}},"system":{"load":{"1":0.19,"15":0.26,"5":0.27,"norm":{"1":0.19,"15":0.26,"5":0.27}}}}}}

From elasticsearch, today's logs, which seems totally fine, creating today's indices:

[2019-02-07T00:00:06,585][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.02.07/nG82CejNSCC43PhxYqtWoQ] update_mapping [doc]
[2019-02-07T00:00:06,687][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.02.07/nG82CejNSCC43PhxYqtWoQ] update_mapping [doc]
[2019-02-07T00:00:06,733][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.02.07/nG82CejNSCC43PhxYqtWoQ] update_mapping [doc]
[2019-02-07T00:00:07,591][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.02.07/nG82CejNSCC43PhxYqtWoQ] update_mapping [doc]
[2019-02-07T00:00:07,634][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.02.07/nG82CejNSCC43PhxYqtWoQ] update_mapping [doc]
[2019-02-07T00:00:08,562][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.02.07/nG82CejNSCC43PhxYqtWoQ] update_mapping [doc]
[2019-02-07T00:00:38,639][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.02.07/nG82CejNSCC43PhxYqtWoQ] update_mapping [doc]
[2019-02-07T02:03:00,003][INFO ][o.e.x.m.MlDailyMaintenanceService] [QyJqSRr] triggering scheduled [ML] maintenance tasks
[2019-02-07T02:03:00,008][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [QyJqSRr] Deleting expired data
[2019-02-07T02:03:00,067][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [QyJqSRr] Completed deletion of expired data
[2019-02-07T02:03:00,075][INFO ][o.e.x.m.MlDailyMaintenanceService] [QyJqSRr] Successfully completed [ML] maintenance tasks
[2019-02-07T07:26:12,760][INFO ][o.e.m.j.JvmGcMonitorService] [QyJqSRr] [gc][471320] overhead, spent [405ms] collecting in the last [1.1s]
[2019-02-07T13:38:31,554][WARN ][o.e.m.j.JvmGcMonitorService] [QyJqSRr] [gc][493631] overhead, spent [562ms] collecting in the last [1s]

From logstash, seems fine too, last log it's about API status:

[2019-02-06T13:55:53,371][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["http://<host>:<port>"]}
[2019-02-06T13:55:53,517][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>".monitoring-logstash", :thread=>"#<Thread:0x17098a01 run>"}
[2019-02-06T13:55:53,552][INFO ][logstash.agent           ] Pipelines running {:count=>2, :running_pipelines=>[:main, :".monitoring-logstash"], :non_running_pipelines=>[]}
[2019-02-06T13:55:54,719][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=><port>}

So I tried deleting index from 2019.01.25. I selected that day because as you can see in my previous comment, is the index which has only 4 documents. I deleted index cache too before deleting the index, and I also stopped filebeat before I deleted the index. When I started filebeat again, it successfully created the index, but still only load 4 documents:

[2019-02-07T14:36:15,892][INFO ][o.e.c.m.MetaDataDeleteIndexService] [QyJqSRr] [filebeat-6.5.4-2019.01.25/BDKlBaIzSeqP_dV3o2MzMA] deleting index
[2019-02-07T14:36:55,215][INFO ][o.e.m.j.JvmGcMonitorService] [QyJqSRr] [gc][497130] overhead, spent [260ms] collecting in the last [1s]
[2019-02-07T14:37:22,309][INFO ][o.e.c.m.MetaDataCreateIndexService] [QyJqSRr] [filebeat-6.5.4-2019.01.25] creating index, cause [auto(bulk api)], templates [filebeat], shards [5]/[1], mappings [doc]
[2019-02-07T14:37:23,314][INFO ][o.e.c.m.MetaDataMappingService] [QyJqSRr] [filebeat-6.5.4-2019.01.25/3UnIBWG2RyuPu2AW5EXeGg] update_mapping [doc]

And the problem only seems to exist with old indices.. Cause last three indices created have around 1 million logs each. But older ones have from 1000 to 3000 logs


(Carolina Baldino) #4

Ok, So I been trying stuff today..

I deleted all the indices and let them re-generate again, to see how they populate, and got the same results as before.

Things changed when I modified the following fields:

...
    close_inactive: 2h
    scan_frequency: 5s
...

The old indices started to populate and I got about 1000 more documents for each index. Which is still a low amount of logs considering that they must be about 3 million logs per index.

So this is getting me very confused, and started to think how really the harvesters works.

When I load filebeat, I realized that it starts one harvester per container, cause it's one harvester per log file.
But since I have log rotation in one container.. how does it handle the rotation?
For example, let's say I have one container in my server, so the harvester starts for:

/var/lib/docker/containers/<container_1>/<container_1>-json.log

How does it handles the rotation I have for container_1, which has for example the following rotation:

<container_1>-json.log
<container_1>-json.log.1
<container_1>-json.log.2

Since the harvester starts for *-json.log and not for *-json.log.1, *-json.log.2 ?

Could this be failing?


(Carolina Baldino) #5

Hi.. I changed some things but I still have the same problem..

  • First we changed the log level from the application side (now we are only printing INFO logs instead of INFO + DEBUG logs).

  • I changed the way filebeat ingest logs and now I am using autodiscover feature, so now I am capturing the logs from the containers I am interested in. Here is my filebeat.yml file:

filebeat.autodiscover:
  providers:
    - type: docker
      templates:
        - condition.contains:
            docker.container.image: <image_name>
          config:
            - type: log
              paths:
                - "/var/lib/docker/containers/${data.docker.container.id}/*.log"
              json.keys_under_root: true
              json.add_error_key: true
              json.message_key: log   
              close_inactive: 15m

So now the amount of logs is significantly lower than before: From 2 to 3 millions documents per index to (as the max number I got from Kibana shows me) less than 150.000.

But I still having the same issue. Filebeat, (or Logstash or Elasticsearch) stops loading logs for some reason until I get the filebeat service restarted or changed the close-inactive frequency from 10 to 15 to 20.

I sill don't trust if it is loading all the logs. To make an example, I deleted yesterday's index that had about 20.000 logs (which is really weird since today's index has 140.000 logs..) and when I let it create again, it only loaded 14.000 logs.

What could be happening ? Is there any recommendation for autodiscover?