Elastic receives small amount of data and nginx buffers the bulk events

Hi List,
Sinces we switched from fluentd to fluent-bit our elastic cluster showed more problems with the bulk queue, is their a way to see whats in the bulk queue and why its not progressing the queue?
the nginx shows many times

a client request body is buffered to a temporary file
request: "POST /_bulk HTTP/1.1", host: "elasticxxxnl"

The 5 nodes in the cluster are really low on cpu

Which version of Elasticsearch are you using?

What is the specification of the nodes in the cluster with respect to RAM, CPU and type of storage used?

Do you have monitoring installed?

Unfortunately I have no experience with fluentd or fluent-bit so can not help with troubleshooting changes or issues there.

This just means that the request body was larger than the size that Nginx was willing to hold in-memory so it spilled it to a file. By default I think this happens whenever the request is larger than 8kB; bulk requests should mostly be >8kB in size so I think this is the expected behaviour. I don't think it's a big deal, but there are probably config options in Nginx to prevent it if needed. I also don't think this has anything to do with Elasticsearch, you'd be better off asking about it on a more Nginx-focussed forum.

(I doubt requests like this hit the disk, the data probably only gets as far as the pagecache, so it's still technically in-memory either way)

2 Likes

Thanks David , this sounds quite logical.
Strange I'm missing 40%-60% of the log entries generated by applications in docker on the kubernetes platforms.
The fluent-bit reports many errors

[2020/12/29 02:51:54] [ warn] [engine] failed to flush chunk '1-1608891285.347503559.flb', retry in 867 seconds: task_id=796, input=tail.0 > output=es.0
[2020/12/29 02:51:55] [ warn] [engine] failed to flush chunk '1-1608894153.539621536.flb', retry in 1110 seconds: task_id=1422, input=tail.0 > output=es.0
[2020/12/29 02:51:55] [ warn] [engine] failed to flush chunk '1-1608895587.328538075.flb', retry in 830 seconds: task_id=1759, input=tail.0 > output=es.0
[2020/12/29 02:51:56] [ warn] [engine] failed to flush chunk '1-1608887565.333424683.flb', retry in 1366 seconds: task_id=35, input=tail.0 > output=es.0
[2020/12/29 02:51:56] [ warn] [engine] failed to flush chunk '1-1608891815.333089060.flb', retry in 1047 seconds: task_id=949, input=tail.0 > output=es.0
[2020/12/29 02:51:56] [ warn] [engine] failed to flush chunk '1-1608888441.890175576.flb', retry in 108 seconds: task_id=253, input=tail.0 > output=es.0
[2020/12/29 02:51:57] [ warn] [engine] failed to flush chunk '1-1608892265.334450560.flb', retry in 727 seconds: task_id=1051, input=tail.0 > output=es.0
[2020/12/29 02:51:57] [ warn] [engine] failed to flush chunk '1-1608889041.727434703.flb', retry in 532 seconds: task_id=388, input=tail.0 > output=es.0

I will tune into fluent-bit as probably the error is their or between fluent-bit and nginx in the firewalls

attached the cpu / memory monitoring of the cluster

It looks like you may be having a lot of small shards in your cluster, which is inefficient and can cause serious problems. I would recommend you look to reduce this significantly given the size and resources available to your cluster.

Yes, those warnings don't have any useful details telling you why it failed or what to do about it. It's possible that the reason is within Elasticsearch of course, but if so Elasticsearch will be returning much more detailed errors describing the problems. You'll need to seek some fluent-bit expertise to see if you can get more details here. I don't see many conversations about it on these forums so you're probably best off trying elsewhere.

If you get hold of errors coming from Elasticsearch and need help understanding them then please ask again here.

Also as Christian says your shard count seems very high, you'd do well to reduce it.

Thanks for your answers, took some time after holiday (everybody happy new year) to dive into fluent-bit errors.

We have now elastic errors when put fluent-bit in trace mode when mapping is wrong, strangely when the bulk of 5 MB contains 1000 events sent from fluent-bit when one event with wrong mapping all events are rejected by elasticsearch.

sure the layout is diffent for all the logs in containers in pods in a namespace but creating a index for each container will create a lot of indexes. Anybody experiance in k8s environment with pods and containers ?