Very slow filebeat/elasic cloud throughput

Here's my cluster, hosting on elastic.co:

My index activity:

My filebeat monitoring graphs:

And my filebeat console output stats:

2018-08-03T07:13:28.751Z INFO [monitoring] log/log.go:124 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":300,"time":302},"total":{"ticks":1520,"time":1525,"value":1520},"user":{"ticks":1220,"time":1223}},"info":{"ephemeral_id":"4dc13585-4910-4d25-822d-94645167e2d5","uptime":{"ms":600010}},"memstats":{"gc_next":35961888,"memory_alloc":25477304,"memory_total":141147488}},"filebeat":{"events":{"active":2,"added":202,"done":200},"harvester":{"open_files":10,"running":10,"started":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":200,"batches":4,"total":200},"read":{"bytes":3459},"write":{"bytes":128119}},"pipeline":{"clients":1,"events":{"active":4117,"filtered":1,"published":200,"total":201},"queue":{"acked":200}}},"registrar":{"states":{"current":2,"update":200},"writes":4},"system":{"load":{"1":3.22,"15":1.6,"5":1.99,"norm":{"1":0.805,"15":0.4,"5":0.4975}}},"xpack":{"monitoring":{"pipeline":{"events":{"published":3,"total":3},"queue":{"acked":3}}}}}}}

As you can see there, it seems to be running/processing 10 files (there's one per minute). It gets way behind on events.

There doesn't seem to be any substantial load on the system at all.

Can anyone advise? Any more info I can give to help diagnose?

Bonus question: what is the "system load" stat - it's presumably not CPU core time, since the Pod in kubernetes that filebeat shares with the actual workload only peaks at ~1.13 cores?

So it looks like the default filebeat config with the "cloud.id" and "cloud.auth" (no explicit output.elasticsearch section) yields terrible performance, 10/sec. As per the sample config I found..

What's an actually-good config for this?

Could you please share your configuration formatted using </>?

System load is the load average of your host. system.load.n is the average load in the last n minutes. As it's a quad-core host, to get the loan on each core you need to look at system.load.norm.n metrics. So in the last 1 minute the load was 0.805 on a core, in the last 5 minutes 0.4975, etc.

{"system":
    {"load":
        {"1": 3.22,
         "15": 1.6,
         "5": 1.99,
         "norm":
            {"1": 0.805,
             "15":0.4,
             "5":0.4975}}},

You have small Elastic Cloud cluster, so it is important to use the resources available as efficiently as possible. I would recommend the following:

  • You have far too many shards for a cluster that size. Change to a single primary shard per index and also consider having each index covering a longer time period, e.g. a month to get the average shard size up and the shard count down. Read this blog post for more details. Reducing the number of shards you are actively indexing into can also help reduce the amount of bulk rejections you might be seeing, which can improve performance as Beats will need to retry less.

  • Having lots of Beats with low volumes write directly into Elasticsearch can result in very small bulk requests, which can be inefficient. Instead try to increase the batch size Beats write in order to improve performance. This blog post provides a good discussion and example. This can lead to it taking longer for documents to reach Elasticsearch. If this is not desireable you can send Beats data through Logstash, which will allow you to tune the batch size across all Beats.

  • It also looks like you have multiple versions of Filebeat writing to separate version-specific indices. Update all Filebeats to the same version to get fewer shards to index into.

1 Like

Thanks Christian, very much appreciated and I'll tweak our setup.

I just went with the defaults and configs that I found on the official site. Is there some config/setup guide I should have stumbled across? It would be a shame if my situation (running into terrible performance problems and posting here) was the official route!

Sorry to vent, but cloud.elastic.io has just been such a pain. The cluster isn't showing any data logged. CPU use is low, mem use low. Forced a restart minutes ago. Support don't respond for hours, by which time it's fixed itself. Getting set up was a pain, default configurations are a pain. I don't know who it's supposed to be aimed at - I guess not me?

If anyone can help... I've been waiting for a Force Restart on my single-node deployment for around an hour..

3 hours now...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.