Filebeat use insane memory

ginger · March 20, 2017, 7:48am

I've noticed the same problem as follows:

Filebeat Version: 5.0.0
My data flow is: filebeat(5 machines) -> ES(2 machines and cpu becomes bottleneck)

My steps to repro:

Kill filebeat first
Remove index DELETE /index_name
Remove old log files and my program will create new files immediately
Start filebeat with filebeat.publish_async: true to get initial state of files

max_procs: 4
filebeat.publish_async: true
filebeat.spool_size: 2048
processors:

include_fields:
fields: ["message", "beat.name"]
output.elasticsearch:
Array of hosts to connect to.
hosts: ["host1:9200", "host2:9200"]
worker: 8
loadbalance: true
pipeline: "pipeline_name"
index: "index_name_%{+YYYY.MM.dd}"
bulk_max_size: 256

Hope this can help to found the cause of the problem.

ruflin · March 20, 2017, 12:43pm

Could you share some log files?
Is the output able to keep up with the message you are sending? Could you share some log files?
If the output is not keeping up, the reasons for the growing memory is probably publish_async as the number of items it needs to keep in memory over time is growing. Be aware that this is an experimental feature.
Please also try to update to the most recent version of filebeat if possible for testing.

ginger · March 21, 2017, 8:58am

The output(elasticsearch) use about 12 of 24 cpu cores, through the load of elasticsearch is high, but should still keeping up. I will update filebeat version later.

ginger · March 21, 2017, 9:01am

some log example:
[fc537c593cdc0fdd0a8addfdd264f6e2] [18/Mar/2017:12:16:01 +0800] [http] [117.164.248.42:33028] [200] [10.47.101.34:80] [200] [5585] [POST /arrow/rank?roomId=100586465&uid=108067592&eid=4&nick=6666666666&protocolVer=3&checkout=%221%22 HTTP/1.1] [250] [472] [arrowio.cmcm.com] [-] [-] [0.002] [0.002] [0.000] [0.002] [0] [16041996952] [1][-] [-] [-] [-]
[ca4fd3602151ee0f80a2eb229bd92848] [18/Mar/2017:12:16:01 +0800] [http] [112.28.163.23:43303] [200] [10.104.13.124:80] [200] [11001] [POST /game/readvip HTTP/1.1] [395] [379] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.2.2; vivo Y17T Build/JDQ39)] [-] [0.004] [0.004] [0.000] [0.004] [0] [16041996941] [1][-] [-] [-] [-]
[da503cc5313a8ef62cb9e2a8b3c8bb22] [18/Mar/2017:12:16:01 +0800] [http] [163.204.17.117:12402] [200] [10.135.62.160:80] [200] [6690] [POST /qzone/m/dungeon/invateList/ HTTP/1.1] [995] [1333] [qzone-mobile.boomegg.cn] [Mozilla/5.0 (Linux; Android 5.1.1; 2014812 Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043015 Safari/537.36 V1_AND_SQ_6.6.9_482_YYB_D QQ/6.6.9.3060 NetType/WIFI WebP/0.3.0 Pixel/720] [http://1105764175.urlshare.cn/home?_proxy=1&_wv=2147624743&via=&uid=404817351&mType=Group] [0.013] [0.013] [0.000] [0.013] [28500] [16041965496] [4][-] [-] [-] [-]
[306cd2474c1ea70244704f0fe89d127b] [18/Mar/2017:12:16:01 +0800] [http] [36.4.62.52:4255] [200] [10.104.50.33:80] [200] [11001] [POST /task/daily HTTP/1.1] [281] [401] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.4.4; N5207 Build/KTU84P)] [-] [0.007] [0.007] [0.002] [0.007] [0] [16041996939] [1][-] [-] [-] [-]
[e2fe971cb71d9d1349355867166aa59d] [18/Mar/2017:12:16:01 +0800] [http] [112.20.218.228:2298] [200] [10.47.101.161:9001] [200] [6209] [POST /ping HTTP/1.1] [195] [254] [arrowiotxnorth3.gz.1252921383.clb.myqcloud.com:9001] [-] [-] [0.001] [0.001] [0.000] [0.001] [0] [16041996966] [1][-] [-] [-] [-]
[5b12614c0c20d9066b7293f9b286d7d5] [18/Mar/2017:12:16:01 +0800] [http] [171.118.47.49:52295] [200] [10.104.242.226:80] [200] [3048] [POST /zgame/?m=snake&a=snake_require HTTP/1.1] [619] [344] [snake-pc-norm-fixip.gz.1252595457.clb.myqcloud.com] [UnityPlayer/5.5.0f3 (http://unity3d.com)] [-] [0.080] [0.039] [0.004] [0.039] [0] [16041996724] [1][-] [-] [-] [-]
[ffda99f074cf9982f9b81f15d8cbfd37] [18/Mar/2017:12:16:01 +0800] [http] [123.122.194.156:34821] [200] [10.6.132.147:80] [200] [4762] [GET /touch/ajax/common/addCollectedtimes.xhtml?movieId=323881648 HTTP/1.1] [1325] [604] [m.gewara.com] [Mozilla/5.0 (Linux; Android 6.0; Letv X500 Build/DBXCNOP5902012161S; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.91 Mobile Safari/537.36 GewaraMovie/8.1.1 appkey/android2009 appSource/AS133 apptype/cinema osType/ANDROID from/appandroid2009 deviceId/020000000000 imei/868256025618413 citycode/110000] [http://m.gewara.com/movie/m/detail.xhtml?movieId=323881648] [0.002] [0.002] [0.000] [0.002] [0] [16041996956] [1][-] [-] [-] [-]
[4503f094f7cf775393452a460f2e7efa] [18/Mar/2017:12:16:01 +0800] [http] [112.28.163.23:43305] [200] [10.104.21.147:80] [200] [11001] [POST /vip/read HTTP/1.1] [450] [378] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.2.2; vivo Y17T Build/JDQ39)] [-] [0.005] [0.005] [0.000] [0.005] [0] [16041996947] [1][-] [-] [-] [-]
[4f118574e5e5d0275221caccae729f11] [18/Mar/2017:12:16:01 +0800] [http] [112.38.127.253:3563] [200] [10.135.30.6:80] [200] [5272] [POST /protocol?uid=14842711&_t=1489810555&_v=1.4 HTTP/1.0] [407] [343] [clover.happyelements.cn] [HeClover] [-] [0.062] [0.011] [0.003] [0.011] [48000] [16041996768] [1][-] [-] [-] [-]
[e4d701f2795a9926b8c422c52cc73af5] [18/Mar/2017:12:16:01 +0800] [http] [112.28.163.23:43307] [200] [10.104.18.156:80] [200] [11001] [POST /backup/readBasicData HTTP/1.1] [462] [381] [game.subway.uu.cc] [Dalvik/1.6.0 (Linux; U; Android 4.2.2; vivo Y17T Build/JDQ39)] [-] [0.007] [0.007] [0.000] [0.007] [0] [16041996951] [1][-] [-] [-] [-]

ruflin · March 21, 2017, 2:19pm

Sorry, I was not clear enough. I meant the logs from filebeat.

ginger · March 23, 2017, 5:18am

hi，ruflin,

Here is a gist link for the log:

gist.github.com

https://gist.github.com/jgq2008303393/6b645062633d2c1c0f157a0e39973618

gistfile1.txt

2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=3, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@77ed6ac1 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1024, completed tasks = 179948759]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=4, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@6b15b985 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1033, completed tasks = 179948778]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=17, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@6b15b985 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1033, completed tasks = 179948778]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=18, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@77ed6ac1 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1024, completed tasks = 179948759]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=21, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@2afeb134 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1032, completed tasks = 179948780]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=29, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@6b15b985 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1033, completed tasks = 179948778]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=30, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@6b15b985 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1033, completed tasks = 179948778]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=35, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@2afeb134 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1032, completed tasks = 179948780]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=39, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@6b15b985 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1033, completed tasks = 179948778]]"}
2017-03-17T18:40:57+08:00 INFO Bulk item insert failed (i=47, status=429): {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$6@6b15b985 on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2c397fc9[Running, pool size = 24, active threads = 24, queued tasks = 1033, completed tasks = 179948778]]"}

This file has been truncated. show original

At the end of the log file, the memory of the machine is run out and restart.

ruflin · March 25, 2017, 8:12pm

As you see in the log file, there are some issue when sending data to elasticsearch. I strongly recommend you do turn off publish_async as this will be the cause of using "insane memory". What is the reason you turned it on in the first place? Was events / s not high enough without it?

I would expect you to see some more details about the errors in the elasticsearch logs.

ginger · March 27, 2017, 1:43am

Yes, the events are not high enough for one of our requirements.

elasticsearch log:

gist.github.com

https://gist.github.com/jgq2008303393/7c67fd5afa60a6286e6653ce7c9ea268

gistfile1.txt

[2017-03-18T00:00:20,975][INFO ][o.e.c.m.MetaDataCreateIndexService] [1489030238000000603] [metricbeat-2017.03.18] creating index, cause [auto(bulk api)], templates [metricbeat, default_template], shards [10]/[1], mappings [_default_]
[2017-03-18T00:00:21,430][INFO ][o.e.c.m.MetaDataMappingService] [1489030238000000603] [metricbeat-2017.03.18/gY9bNhycSGqLTngg_8icXg] create_mapping [metricsets]
[2017-03-18T00:00:21,799][INFO ][o.e.c.r.a.AllocationService] [1489030238000000603] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[metricbeat-2017.03.18][0], [metricbeat-2017.03.18][5]] ...]).
[2017-03-18T11:24:56,615][ERROR][o.e.a.i.IngestActionFilter] [1489030238000000603] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@79ceaa6a on EsThreadPoolExecutor[bulk, queue capacity = 1024, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4ceba158[Running, pool size = 24, active threads = 24, queued tasks = 1037, completed tasks = 216094050]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) ~[elasticsearch-5.0.0.jar:5.0.0]
	at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) ~[?:1.8.0_91]
	at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) ~[?:1.8.0_91]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:95) ~[elasticsearch-5.0.0.jar:5.0.0]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:90) ~[elasticsearch-5.0.0.jar:5.0.0]

This file has been truncated. show original

ruflin · March 27, 2017, 7:00pm

How many events per second do you have? Did you test if Filebeat is the limiting factor?
What kind of pipeline are you using in elasticsearch?

ginger · March 28, 2017, 11:51am

How many events per second do you have? Did you test if Filebeat is the limiting factor?
30k events per second per machine in one of our requirements, and I think filebeat is not the limiting factor, filebeat consume 1.5 core to process those events. By the way, filebeat 5.0.0 can consume about 20k events per second per core, while filebeat 5.2.2 can only do 10k events per second per core.

What kind of pipeline are you using in elasticsearch?
a grok processor to extract fields from log events.

ruflin · March 28, 2017, 12:10pm

Lets discuss point by point:

If filebeat is not the limiting factor, you should not need publish_async so the memory issue should disappear. Can you disable publish_async and confirm that?
Grok processor: A grok processor can be quite expensive. It seems you hit a limit there. I would recommend to open a discuss topic in the elasticsearch forum (with a link to this thread) and the above error and check on what you can do to get rid of it
5.0 to 5.2.2: That would be regression. So exact same setup leads in 5.2.2 less events/s then in 5.0? Could do run this comparison with file output to get the output performance out of the equation?

ginger · March 28, 2017, 12:25pm

If filebeat is not the limiting factor, you should not need publish_async so the memory issue should disappear. Can you disable publish_async and confirm that?

sorry, I was not clear enough. I means that filebeat does not run out of cpu resource when testing. In production environment, we would like filebeat cost less cpu resource. After I close publish_async, the memory issue disappear indeed.

Grok processor: A grok processor can be quite expensive. It seems you hit a limit there. I would recommend to open a discuss topic in the elasticsearch forum (with a link to this thread) and the above error and check on what you can do to get rid of it

Ok, I will do it.

5.0 to 5.2.2: That would be regression. So exact same setup leads in 5.2.2 less events/s then in 5.0? Could do run this comparison with file output to get the output performance out of the equation?

Yes, xact same setup leads, I just replace the filebeat bin file.

ginger · March 29, 2017, 2:44am

I don't repro insane memory problem with filebeat 5.2.2.

ruflin · March 30, 2017, 11:17am

And you still have publish_async enabled?

ginger · March 30, 2017, 12:48pm

Sorry, I reproduce this problem again with publish_async enabled, and the probability is very low.

ginger · March 30, 2017, 12:50pm

I also double check this, and hope this problem can be solved.

ginger · March 30, 2017, 1:48pm

Sorry, I reproduce this problem again with publish_async enabled, and the probability is very low.

steffens · March 31, 2017, 11:59am

publish_async is deprecated. It has been experimental and did lead to too many problems. It's no regression, as bad behavior can occur with any release. You're highly encouraged to disable it. Instead enable pipelining: true in logstash output and have spooler_size > bulk_max_size.

ginger · March 31, 2017, 12:21pm

Ok. By the way, Filebeat write directly into Elasticsearch in my environment.

steffens · April 3, 2017, 8:39am

Oh, I see. Well, the Elasticsearch output doesn't support pipelining. The best you can do is have multiple workers and have spooler_size > bulk_max_size. With N = spooler_size / bulk_max_size, the bigger N, the more sub-batches will be processed concurrently (at the cost of increased memory size).

Topic		Replies	Views
Filebeat doesn't clear the registry file after logrotation Beats filebeat	15	10458	February 20, 2017
Filebeat Registry - Will I get duplicates if I delete Beats filebeat	3	1839	May 21, 2020
Filebeat stops updating the registry Beats filebeat	8	2856	January 2, 2017
Filebeat, send log file again Beats filebeat	4	12233	April 12, 2017
Filebeat 5.0alpha3/4 doesn't clean registry and crashes Beats filebeat	4	962	August 23, 2016

Filebeat use insane memory

Array of hosts to connect to.

Related topics