Filebeat use insane memory

I've noticed the same problem as follows:

Filebeat Version: 5.0.0
My data flow is: filebeat(5 machines) -> ES(2 machines and cpu becomes bottleneck)

My steps to repro:

  1. Kill filebeat first
  2. Remove index DELETE /index_name
  3. Remove old log files and my program will create new files immediately
  4. Start filebeat with filebeat.publish_async: true to get initial state of files

    max_procs: 4
    filebeat.publish_async: true
    filebeat.spool_size: 2048
    processors:

  • include_fields:
    fields: ["message", "beat.name"]
    output.elasticsearch:

    Array of hosts to connect to.

    hosts: ["host1:9200", "host2:9200"]
    worker: 8
    loadbalance: true
    pipeline: "pipeline_name"
    index: "index_name_%{+YYYY.MM.dd}"
    bulk_max_size: 256

Hope this can help to found the cause of the problem.

  • Could you share some log files?
  • Is the output able to keep up with the message you are sending? Could you share some log files?
  • If the output is not keeping up, the reasons for the growing memory is probably publish_async as the number of items it needs to keep in memory over time is growing. Be aware that this is an experimental feature.
  • Please also try to update to the most recent version of filebeat if possible for testing.

The output(elasticsearch) use about 12 of 24 cpu cores, through the load of elasticsearch is high, but should still keeping up. I will update filebeat version later.

some log example:
[fc537c593cdc0fdd0a8addfdd264f6e2] [18/Mar/2017:12:16:01 +0800] [http] [117.164.248.42:33028] [200] [10.47.101.34:80] [200] [5585] [POST /arrow/rank?roomId=100586465&uid=108067592&eid=4&nick=6666666666&protocolVer=3&checkout=%221%22 HTTP/1.1] [250] [472] [arrowio.cmcm.com] [-] [-] [0.002] [0.002] [0.000] [0.002] [0] [16041996952] [1][-] [-] [-] [-]
[ca4fd3602151ee0f80a2eb229bd92848] [18/Mar/2017:12:16:01 +0800] [http] [112.28.163.23:43303] [200] [10.104.13.124:80] [200] [11001] [POST /game/readvip HTTP/1.1] [395] [379] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.2.2; vivo Y17T Build/JDQ39)] [-] [0.004] [0.004] [0.000] [0.004] [0] [16041996941] [1][-] [-] [-] [-]
[da503cc5313a8ef62cb9e2a8b3c8bb22] [18/Mar/2017:12:16:01 +0800] [http] [163.204.17.117:12402] [200] [10.135.62.160:80] [200] [6690] [POST /qzone/m/dungeon/invateList/ HTTP/1.1] [995] [1333] [qzone-mobile.boomegg.cn] [Mozilla/5.0 (Linux; Android 5.1.1; 2014812 Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043015 Safari/537.36 V1_AND_SQ_6.6.9_482_YYB_D QQ/6.6.9.3060 NetType/WIFI WebP/0.3.0 Pixel/720] [http://1105764175.urlshare.cn/home?_proxy=1&_wv=2147624743&via=&uid=404817351&mType=Group] [0.013] [0.013] [0.000] [0.013] [28500] [16041965496] [4][-] [-] [-] [-]
[306cd2474c1ea70244704f0fe89d127b] [18/Mar/2017:12:16:01 +0800] [http] [36.4.62.52:4255] [200] [10.104.50.33:80] [200] [11001] [POST /task/daily HTTP/1.1] [281] [401] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.4.4; N5207 Build/KTU84P)] [-] [0.007] [0.007] [0.002] [0.007] [0] [16041996939] [1][-] [-] [-] [-]
[e2fe971cb71d9d1349355867166aa59d] [18/Mar/2017:12:16:01 +0800] [http] [112.20.218.228:2298] [200] [10.47.101.161:9001] [200] [6209] [POST /ping HTTP/1.1] [195] [254] [arrowiotxnorth3.gz.1252921383.clb.myqcloud.com:9001] [-] [-] [0.001] [0.001] [0.000] [0.001] [0] [16041996966] [1][-] [-] [-] [-]
[5b12614c0c20d9066b7293f9b286d7d5] [18/Mar/2017:12:16:01 +0800] [http] [171.118.47.49:52295] [200] [10.104.242.226:80] [200] [3048] [POST /zgame/?m=snake&a=snake_require HTTP/1.1] [619] [344] [snake-pc-norm-fixip.gz.1252595457.clb.myqcloud.com] [UnityPlayer/5.5.0f3 (http://unity3d.com)] [-] [0.080] [0.039] [0.004] [0.039] [0] [16041996724] [1][-] [-] [-] [-]
[ffda99f074cf9982f9b81f15d8cbfd37] [18/Mar/2017:12:16:01 +0800] [http] [123.122.194.156:34821] [200] [10.6.132.147:80] [200] [4762] [GET /touch/ajax/common/addCollectedtimes.xhtml?movieId=323881648 HTTP/1.1] [1325] [604] [m.gewara.com] [Mozilla/5.0 (Linux; Android 6.0; Letv X500 Build/DBXCNOP5902012161S; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.91 Mobile Safari/537.36 GewaraMovie/8.1.1 appkey/android2009 appSource/AS133 apptype/cinema osType/ANDROID from/appandroid2009 deviceId/020000000000 imei/868256025618413 citycode/110000] [http://m.gewara.com/movie/m/detail.xhtml?movieId=323881648] [0.002] [0.002] [0.000] [0.002] [0] [16041996956] [1][-] [-] [-] [-]
[4503f094f7cf775393452a460f2e7efa] [18/Mar/2017:12:16:01 +0800] [http] [112.28.163.23:43305] [200] [10.104.21.147:80] [200] [11001] [POST /vip/read HTTP/1.1] [450] [378] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.2.2; vivo Y17T Build/JDQ39)] [-] [0.005] [0.005] [0.000] [0.005] [0] [16041996947] [1][-] [-] [-] [-]
[4f118574e5e5d0275221caccae729f11] [18/Mar/2017:12:16:01 +0800] [http] [112.38.127.253:3563] [200] [10.135.30.6:80] [200] [5272] [POST /protocol?uid=14842711&_t=1489810555&_v=1.4 HTTP/1.0] [407] [343] [clover.happyelements.cn] [HeClover] [-] [0.062] [0.011] [0.003] [0.011] [48000] [16041996768] [1][-] [-] [-] [-]
[e4d701f2795a9926b8c422c52cc73af5] [18/Mar/2017:12:16:01 +0800] [http] [112.28.163.23:43307] [200] [10.104.18.156:80] [200] [11001] [POST /backup/readBasicData HTTP/1.1] [462] [381] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.2.2; vivo Y17T Build/JDQ39)] [-] [0.007] [0.007] [0.000] [0.007] [0] [16041996951] [1][-] [-] [-] [-]

Sorry, I was not clear enough. I meant the logs from filebeat.

hi,ruflin,

Here is a gist link for the log:

At the end of the log file, the memory of the machine is run out and restart.

As you see in the log file, there are some issue when sending data to elasticsearch. I strongly recommend you do turn off publish_async as this will be the cause of using "insane memory". What is the reason you turned it on in the first place? Was events / s not high enough without it?

I would expect you to see some more details about the errors in the elasticsearch logs.

Yes, the events are not high enough for one of our requirements.

elasticsearch log:

  • How many events per second do you have? Did you test if Filebeat is the limiting factor?
  • What kind of pipeline are you using in elasticsearch?

How many events per second do you have? Did you test if Filebeat is the limiting factor?
30k events per second per machine in one of our requirements, and I think filebeat is not the limiting factor, filebeat consume 1.5 core to process those events. By the way, filebeat 5.0.0 can consume about 20k events per second per core, while filebeat 5.2.2 can only do 10k events per second per core.

What kind of pipeline are you using in elasticsearch?
a grok processor to extract fields from log events.

Lets discuss point by point:

  • If filebeat is not the limiting factor, you should not need publish_async so the memory issue should disappear. Can you disable publish_async and confirm that?
  • Grok processor: A grok processor can be quite expensive. It seems you hit a limit there. I would recommend to open a discuss topic in the elasticsearch forum (with a link to this thread) and the above error and check on what you can do to get rid of it
  • 5.0 to 5.2.2: That would be regression. So exact same setup leads in 5.2.2 less events/s then in 5.0? Could do run this comparison with file output to get the output performance out of the equation?
  • If filebeat is not the limiting factor, you should not need publish_async so the memory issue should disappear. Can you disable publish_async and confirm that?

sorry, I was not clear enough. I means that filebeat does not run out of cpu resource when testing. In production environment, we would like filebeat cost less cpu resource. After I close publish_async, the memory issue disappear indeed.

  • Grok processor: A grok processor can be quite expensive. It seems you hit a limit there. I would recommend to open a discuss topic in the elasticsearch forum (with a link to this thread) and the above error and check on what you can do to get rid of it

Ok, I will do it.

  • 5.0 to 5.2.2: That would be regression. So exact same setup leads in 5.2.2 less events/s then in 5.0? Could do run this comparison with file output to get the output performance out of the equation?

Yes, xact same setup leads, I just replace the filebeat bin file.

I don't repro insane memory problem with filebeat 5.2.2.

And you still have publish_async enabled?

Sorry, I reproduce this problem again with publish_async enabled, and the probability is very low.

I also double check this, and hope this problem can be solved.

Sorry, I reproduce this problem again with publish_async enabled, and the probability is very low.

publish_async is deprecated. It has been experimental and did lead to too many problems. It's no regression, as bad behavior can occur with any release. You're highly encouraged to disable it. Instead enable pipelining: true in logstash output and have spooler_size > bulk_max_size.

Ok. By the way, Filebeat write directly into Elasticsearch in my environment.

Oh, I see. Well, the Elasticsearch output doesn't support pipelining. The best you can do is have multiple workers and have spooler_size > bulk_max_size. With N = spooler_size / bulk_max_size, the bigger N, the more sub-batches will be processed concurrently (at the cost of increased memory size).