Recommended Elasticsearch Node requirements for monitoring 2000 nodes

Hello, I'm starting to research Metricbeat + Kibana Inventory to monitor system resources.
The total number machine that we are planning to monitor is about 2,000.

Currently I was able to deploy metricbeat to 79 hosts and explore the metric on Kibana Inventory whithout any problem.

Next step will be expanding it to more nodes, and I want to prepare enough Elasticsearch Node for it.

Currently I've tested with on this environment.

  • 5 Data Nodes (16G/16Core) + 3 Master Node(16G/16Core)

metricbeat.yml

logging.level: info

output.elasticsearch:
  ...
  worker: 2
  bulk_max_size : 1024

queue:
  mem :
    events: 4096
    flush.min_events: 2048
max_procs : 1

setup.ilm:
  enabled : auto

setup.dashboards.enabled: false
setup.template.settings:
  index:
    codec: best_compression
    number_of_shards: 5
    number_of_replicas: 1
    refresh_interval: 10s

setup.kibana:
  ...

#------ metric beat specific configuration

metricbeat.max_start_delay: 10s
metricbeat.modules:
  - module: system
    metricsets:
      - cpu             # CPU usage
      - load            # CPU load averages
      - memory          # Memory usage
      - network         # Network IO
      - uptime          # System Uptime
      - fsstat          # File system summary metrics
      - diskio          # Disk IO
      - process_summary # process 요약


    enabled: true
    period: 10s
    processes: ['.*']

    # Configure the metric types that are included by these metricsets.
    cpu.metrics:  ["percentages", "normalized_percentages"]  # The other available options are normalized_percentages and ticks.
    core.metrics: ["percentages"]  # The other available option is ticks.
processors:
- add_host_metadata:

Hi @Bingu_Shim, I've reached out internally regarding your question, and hope to have some information for you soon.

1 Like

Hi, after speaking with a colleague, unfortunately it's hard for us to answer this as it's often a case of "it depends".

Our reccomendation would be to experiment with as much (realistic) data as possible, and see if there's a bottleneck. Serving 2,000 nodes in the UI shouldn't be a problem, but your cluster might not be able to handle the write load. If this is the case, you may need to add more shards to optimise for a write heavy environment.

Hello @Kerry

Thank you for your response.
With my configurations above, the number of documents written per minutes are as follow.

event.dataset docs per minutes
system.diskio 6
system.fsstat 6
system.load 6
system.uptime 6
system.cpu 6
system.memory 6
system.process.summary 6
system.network 96
total 138

So, I can get required write performance as follow.

  • 2.3 tps per 1 node
  • 4,600 tps for 2,000 nodes.

This write load should be handled with our elasticsearch cluster. (I've done over 30K write performance test on our environment. The use cases was different though)

What I want to know is Metric UI sides latency.
Since I got UI latency problem with using Elastic APM (THIS ISSUE, THIS ISSUE), and found out the team just started architectural improvement to solve latency problem.

So we just want to make it sure that scalability of Metric UI, before going further.

As you mentioned as follow, there won't be scalability problem with UI side.

Serving 2,000 nodes in the UI shouldn't be a problem

We will trying to apply Metricbeat more machines.