Elastic receives small amount of data and nginx buffers the bulk events

opouwels · December 24, 2020, 10:21am

Hi List,
Sinces we switched from fluentd to fluent-bit our elastic cluster showed more problems with the bulk queue, is their a way to see whats in the bulk queue and why its not progressing the queue?
the nginx shows many times

a client request body is buffered to a temporary file
request: "POST /_bulk HTTP/1.1", host: "elasticxxxnl"

The 5 nodes in the cluster are really low on cpu

Christian_Dahlqvist · December 26, 2020, 9:41am

Which version of Elasticsearch are you using?

What is the specification of the nodes in the cluster with respect to RAM, CPU and type of storage used?

Do you have monitoring installed?

Unfortunately I have no experience with fluentd or fluent-bit so can not help with troubleshooting changes or issues there.

DavidTurner · December 26, 2020, 10:15am

This just means that the request body was larger than the size that Nginx was willing to hold in-memory so it spilled it to a file. By default I think this happens whenever the request is larger than 8kB; bulk requests should mostly be >8kB in size so I think this is the expected behaviour. I don't think it's a big deal, but there are probably config options in Nginx to prevent it if needed. I also don't think this has anything to do with Elasticsearch, you'd be better off asking about it on a more Nginx-focussed forum.

(I doubt requests like this hit the disk, the data probably only gets as far as the pagecache, so it's still technically in-memory either way)

opouwels · December 29, 2020, 9:59am

Thanks David , this sounds quite logical.
Strange I'm missing 40%-60% of the log entries generated by applications in docker on the kubernetes platforms.
The fluent-bit reports many errors

[2020/12/29 02:51:54] [ warn] [engine] failed to flush chunk '1-1608891285.347503559.flb', retry in 867 seconds: task_id=796, input=tail.0 > output=es.0
[2020/12/29 02:51:55] [ warn] [engine] failed to flush chunk '1-1608894153.539621536.flb', retry in 1110 seconds: task_id=1422, input=tail.0 > output=es.0
[2020/12/29 02:51:55] [ warn] [engine] failed to flush chunk '1-1608895587.328538075.flb', retry in 830 seconds: task_id=1759, input=tail.0 > output=es.0
[2020/12/29 02:51:56] [ warn] [engine] failed to flush chunk '1-1608887565.333424683.flb', retry in 1366 seconds: task_id=35, input=tail.0 > output=es.0
[2020/12/29 02:51:56] [ warn] [engine] failed to flush chunk '1-1608891815.333089060.flb', retry in 1047 seconds: task_id=949, input=tail.0 > output=es.0
[2020/12/29 02:51:56] [ warn] [engine] failed to flush chunk '1-1608888441.890175576.flb', retry in 108 seconds: task_id=253, input=tail.0 > output=es.0
[2020/12/29 02:51:57] [ warn] [engine] failed to flush chunk '1-1608892265.334450560.flb', retry in 727 seconds: task_id=1051, input=tail.0 > output=es.0
[2020/12/29 02:51:57] [ warn] [engine] failed to flush chunk '1-1608889041.727434703.flb', retry in 532 seconds: task_id=388, input=tail.0 > output=es.0

I will tune into fluent-bit as probably the error is their or between fluent-bit and nginx in the firewalls

attached the cpu / memory monitoring of the cluster

Christian_Dahlqvist · December 29, 2020, 10:18am

It looks like you may be having a lot of small shards in your cluster, which is inefficient and can cause serious problems. I would recommend you look to reduce this significantly given the size and resources available to your cluster.

DavidTurner · December 29, 2020, 12:22pm

Yes, those warnings don't have any useful details telling you why it failed or what to do about it. It's possible that the reason is within Elasticsearch of course, but if so Elasticsearch will be returning much more detailed errors describing the problems. You'll need to seek some fluent-bit expertise to see if you can get more details here. I don't see many conversations about it on these forums so you're probably best off trying elsewhere.

If you get hold of errors coming from Elasticsearch and need help understanding them then please ask again here.

Also as Christian says your shard count seems very high, you'd do well to reduce it.

opouwels · January 5, 2021, 12:30pm

Thanks for your answers, took some time after holiday (everybody happy new year) to dive into fluent-bit errors.

We have now elastic errors when put fluent-bit in trace mode when mapping is wrong, strangely when the bulk of 5 MB contains 1000 events sent from fluent-bit when one event with wrong mapping all events are rejected by elasticsearch.

sure the layout is diffent for all the logs in containers in pods in a namespace but creating a index for each container will create a lot of indexes. Anybody experiance in k8s environment with pods and containers ?

system · February 2, 2021, 12:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
There is a dataloss when data is been pushed to ElasticSearch via Fluentd Elasticsearch	2	1256	June 6, 2019
Strange Fluent-Bit logs Elasticsearch	3	2310	March 12, 2019
Bulk load via action.bulk API dropping events when overloaded? Elasticsearch	6	1466	July 6, 2017
Fluentd to elastic Elasticsearch	15	3542	August 13, 2019
Cannot increase buffer: current=512000 requested=544768 max=512000 Kibana	2	1620	August 9, 2023

Elastic receives small amount of data and nginx buffers the bulk events

Related topics