Metricbeat cluster sizing

Interesting... Your diskio document counts are very high there must be many many disks on each host they are about 3-5X what I would expect.

But other numbers see ok I guess it still seems high for 1000 metricbeat hosts.

So here are some of the things I would do, others may have other opinions.

You can try this is on your 1 node

You already have a lot of segments (underlying filesystem data structures) You are not force merging when you roll over so you are generating segment that are beginning to add up. Los of segments = slow queries, you already have 322 segments when you only had 22 shards.

In the ILM Policy On Rollover set force merge to 1 segment ,

You can see your segments with

GET _cat/segments/metricbeat-*/?v

You can clean this up by running the following command this may take 1 or more hours to run, as there is only 1 merge thread per node.

POST metricbeat-*/_forcemerge/?max_num_segments=1

it is a synchronous command but you can just run another command and check the results.

GET _cat/segments/metricbeat-*/?v

Once the segments are merged there will be only 1 per shard.

But over all... if you are really going to ingest and query 350GB / day or more, I would probably run more than a single node. Here are some suggestions, others may have other suggestions.

350GB / Day is non-trivial but we certainly have many use cases with Multiple TBs per day, its about proper scaling.

I would run perhaps 3 nodes, Each with 28GB Heap 1-2 TB SSD
Index Template : 3 Primary Shards, 1 replica (technically this would be better with 6 node so each shards can be completely parallel, there is some math) (If you do not want replicas you can do that , but if you lose a node you will corrupt your index)
ILM Rollover 150GB or 1 day : This will make 3 x 50GB Shards, the shards should balance out and you will get some parallelism.
Force Merge on Hot Rollover to 1 segment.
Your indexing seems OKish there are some settings that could make that better like

"index": {
  "refresh_interval": "30s",
  "translog": {
    "flush_threshold_size": "2gb"
  }

Other considerations would be how long the retention which you have not mentioned os say you wanted to keep this for 7 Days = 350GB Day + 1 replica = 700 GB / Day * 7 Days = ~5TB Data.

Other consideration is the you have some bottleneck with the IOPs, but it that is direct attached SSD but I am not really familiar with AHCI

2 Likes