Metricbeat cluster sizing

Hello, I need your help.
I’m deploying metricbeat on 3000 servers. The problem is that from 1000 agents loaded, kibana has problems to display the metrics. In fact, there are metrics that kibana cannot display. I think i’m doing something wrong with the cluster config or sizing. I currently have:

  • 3 master nodes, 3 data nodes, 1 kibana. Each server has 8 vCPU, 16GB RAM (8GB jvm), 1TB ssd (HCI)

I just only send metrics with metricbeat + system module every 20 seconds. Shards, compression and ILM config is all set by default. I only have one index and it splits every 50GB.

Thank you

What is the output from the _cluster/stats?pretty&human API?

  "_nodes" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "cluster_name" : "foo",
  "cluster_uuid" : "-zJEjKymQ_CKIrXx9XHj5g",
  "timestamp" : 1624513972272,
  "status" : "red",
  "indices" : {
    "count" : 41,
    "shards" : {
      "total" : 82,
      "primaries" : 41,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 932226156,
      "deleted" : 25433
    },
    "store" : {
      "size" : "782.3gb",
      "size_in_bytes" : 840048702152,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "2.3mb",
      "memory_size_in_bytes" : 2489768,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "120.8mb",
      "memory_size_in_bytes" : 126746955,
      "total_count" : 1090515,
      "hit_count" : 67387,
      "miss_count" : 1023128,
      "cache_size" : 7475,
      "cache_count" : 12456,
      "evictions" : 4981
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 1512,
      "memory" : "74.2mb",
      "memory_in_bytes" : 77816184,
      "terms_memory" : "14.4mb",
      "terms_memory_in_bytes" : 15189440,
      "stored_fields_memory" : "1.1mb",
      "stored_fields_memory_in_bytes" : 1203392,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "9kb",
      "norms_memory_in_bytes" : 9216,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "58.5mb",
      "doc_values_memory_in_bytes" : 61414136,
      "index_writer_memory" : "506.5mb",
      "index_writer_memory_in_bytes" : 531149290,
      "version_map_memory" : "2.9mb",
      "version_map_memory_in_bytes" : 3092419,
      "fixed_bit_set" : "181.1mb",
      "fixed_bit_set_memory_in_bytes" : 189947824,
      "max_unsafe_auto_id_timestamp" : 1624511779899,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 12,
          "index_count" : 6
        },
        {
          "name" : "boolean",
          "count" : 422,
          "index_count" : 29
        },
        {
          "name" : "byte",
          "count" : 13,
          "index_count" : 13
        },
        {
          "name" : "constant_keyword",
          "count" : 2,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 548,
          "index_count" : 33
        },
        {
          "name" : "double",
          "count" : 961,
          "index_count" : 13
        },
        {
          "name" : "float",
          "count" : 1279,
          "index_count" : 20
        },
        {
          "name" : "geo_point",
          "count" : 84,
          "index_count" : 13
        },
        {
          "name" : "half_float",
          "count" : 56,
          "index_count" : 14
        },
        {
          "name" : "integer",
          "count" : 154,
          "index_count" : 7
        },
        {
          "name" : "ip",
          "count" : 239,
          "index_count" : 14
        },
        {
          "name" : "keyword",
          "count" : 9320,
          "index_count" : 33
        },
        {
          "name" : "long",
          "count" : 17440,
          "index_count" : 32
        },
        {
          "name" : "nested",
          "count" : 23,
          "index_count" : 9
        },
        {
          "name" : "object",
          "count" : 17469,
          "index_count" : 32
        },
        {
          "name" : "scaled_float",
          "count" : 1185,
          "index_count" : 13
        },
        {
          "name" : "text",
          "count" : 684,
          "index_count" : 26
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.12.1",
        "index_count" : 42,
        "primary_shard_count" : 42,
        "total_primary_size" : "397.3gb",
        "total_primary_bytes" : 426614461213
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 6,
      "coordinating_only" : 0,
      "data" : 3,
      "data_cold" : 3,
      "data_content" : 3,
      "data_frozen" : 3,
      "data_hot" : 3,
      "data_warm" : 3,
      "ingest" : 6,
      "master" : 3,
      "ml" : 6,
      "remote_cluster_client" : 6,
      "transform" : 3,
      "voting_only" : 0
    },
    "versions" : [
      "7.12.1"
    ],
    "os" : {
      "available_processors" : 56,
      "allocated_processors" : 56,
      "names" : [
        {
          "name" : "Linux",
          "count" : 6
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 6
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 6
        }
      ],
      "mem" : {
        "total" : "93.5gb",
        "total_in_bytes" : 100501094400,
        "free" : "10.9gb",
        "free_in_bytes" : 11781750784,
        "used" : "82.6gb",
        "used_in_bytes" : 88719343616,
        "free_percent" : 12,
        "used_percent" : 88
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 90
      },
      "open_file_descriptors" : {
        "min" : 1065,
        "max" : 11840,
        "avg" : 7607
      }
    },
    "jvm" : {
      "max_uptime" : "9.3d",
      "max_uptime_in_millis" : 805845195,
      "versions" : [
        {
          "version" : "16",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "16+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 6
        }
      ],
      "mem" : {
        "heap_used" : "21gb",
        "heap_used_in_bytes" : 22609280704,
        "heap_max" : "48gb",
        "heap_max_in_bytes" : 51539607552
      },
      "threads" : 490
    },
    "fs" : {
      "total" : "3.6tb",
      "total_in_bytes" : 4015312076800,
      "free" : "2.8tb",
      "free_in_bytes" : 3172988551168,
      "available" : "2.7tb",
      "available_in_bytes" : 2968880099328
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 6
      },
      "http_types" : {
        "security4" : 6
      }
    },
    "discovery_types" : {
      "zen" : 6
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 6
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 18,
      "processor_stats" : {
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }

Thanks, there's nothing immediate that jumps out to me there.

Can you elaborate more on what you are seeing here?

Hi warkolm,
For example, when I do a search with an interval of one hour Kibana takes between 10 and 15 seconds to complete (is it normal?). When I do a search with a range of 10 hours or more, kibana takes between 40-50 seconds to complete and doesn't display some graphs. I tried changing the infrastructure to a single physical node, with 256GB of memory (28GB Heap), 20 cores, and 5TB SSD but the behavior is almost the same.

  • Maybe there are too many agents (1000) for the current infrastructure?
  • Maybe I have to extend the period for each metric processed in metricbeat? Current is every 10 seconds.
  • Or do I have to group the agents with different indexes? I have all the agents in the same index/single shard and it splits every 50GB.

Kibana dashboard:

Metricbeat config:
image

image

Index:
image

Index health:

Node health:
image

Thank you

Hi @premierpsp

Couple questions....

How many hours is that 47.8 GB of data?

Curious is that one of the default dashboards?

Also If you go to discover and say just load 24 hours of metricbeat-* data how long does it take?

How big is your Kibana instance?

Hi Stephenb,

3,5 hours

yep, with some changes

it took 66 seconds

For now everything is on the same server (master, data, kibana). it has 256GB RAM (28GB for Heap), 20 cores, 5TB SSD.

I also leave you some values. In the Search Latency graph, you can see the high peak when I made the query that took 66 seconds:

Search Latency spike in a normal query (0-1 hour):

Indexes:

Stats:

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "foo",
  "cluster_uuid" : "3JKetp79RCGU8e_b-NPebQ",
  "timestamp" : 1625941385746,
  "status" : "green",
  "indices" : {
    "count" : 22,
    "shards" : {
      "total" : 22,
      "primaries" : 22,
      "replication" : 0.0,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 0.0,
          "max" : 0.0,
          "avg" : 0.0
        }
      }
    },
    "docs" : {
      "count" : 901073763,
      "deleted" : 28242
    },
    "store" : {
      "size" : "350gb",
      "size_in_bytes" : 375883101668,
      "total_data_set_size" : "350gb",
      "total_data_set_size_in_bytes" : 375883101668,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "276.7kb",
      "memory_size_in_bytes" : 283360,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "175mb",
      "memory_size_in_bytes" : 183588982,
      "total_count" : 3993454,
      "hit_count" : 993380,
      "miss_count" : 3000074,
      "cache_size" : 4494,
      "cache_count" : 6386,
      "evictions" : 1892
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 366,
      "memory" : "16.5mb",
      "memory_in_bytes" : 17358472,
      "terms_memory" : "3.7mb",
      "terms_memory_in_bytes" : 3949472,
      "stored_fields_memory" : "397.2kb",
      "stored_fields_memory_in_bytes" : 406752,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "3.5kb",
      "norms_memory_in_bytes" : 3584,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "12.3mb",
      "doc_values_memory_in_bytes" : 12998664,
      "index_writer_memory" : "118.9mb",
      "index_writer_memory_in_bytes" : 124698936,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "80.7mb",
      "fixed_bit_set_memory_in_bytes" : 84640504,
      "max_unsafe_auto_id_timestamp" : 1625932807048,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 24,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 488,
          "index_count" : 12,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 8,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 465,
          "index_count" : 13,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 1330,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 1712,
          "index_count" : 9,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 56,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 15,
          "index_count" : 3,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 66,
          "index_count" : 3,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 168,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 7595,
          "index_count" : 13,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 19429,
          "index_count" : 13,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 10,
          "index_count" : 4,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 19555,
          "index_count" : 13,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 1095,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 448,
          "index_count" : 13,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.13.3",
        "index_count" : 22,
        "primary_shard_count" : 22,
        "total_primary_size" : "350gb",
        "total_primary_bytes" : 375883101668
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 1,
      "coordinating_only" : 0,
      "data" : 1,
      "data_cold" : 1,
      "data_content" : 1,
      "data_frozen" : 1,
      "data_hot" : 1,
      "data_warm" : 1,
      "ingest" : 1,
      "master" : 1,
      "ml" : 1,
      "remote_cluster_client" : 1,
      "transform" : 1,
      "voting_only" : 0
    },
    "versions" : [
      "7.13.3"
    ],
    "os" : {
      "available_processors" : 40,
      "allocated_processors" : 40,
      "names" : [
        {
          "name" : "Linux",
          "count" : 1
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Oracle Linux Server 7.9",
          "count" : 1
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 1
        }
      ],
      "mem" : {
        "total" : "251.4gb",
        "total_in_bytes" : 269966901248,
        "free" : "912.7mb",
        "free_in_bytes" : 957136896,
        "used" : "250.5gb",
        "used_in_bytes" : 269009764352,
        "free_percent" : 0,
        "used_percent" : 100
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 14
      },
      "open_file_descriptors" : {
        "min" : 1739,
        "max" : 1739,
        "avg" : 1739
      }
    },
    "jvm" : {
      "max_uptime" : "23.7h",
      "max_uptime_in_millis" : 85391877,
      "versions" : [
        {
          "version" : "16",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "16+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 1
        }
      ],
      "mem" : {
        "heap_used" : "18.1gb",
        "heap_used_in_bytes" : 19469324264,
        "heap_max" : "28gb",
        "heap_max_in_bytes" : 30064771072
      },
      "threads" : 221
    },
    "fs" : {
      "total" : "4.8tb",
      "total_in_bytes" : 5366567927808,
      "free" : "4.5tb",
      "free_in_bytes" : 4988930772992,
      "available" : "4.5tb",
      "available_in_bytes" : 4988930772992
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 1
      },
      "http_types" : {
        "security4" : 1
      }
    },
    "discovery_types" : {
      "zen" : 1
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 1
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 1,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

On on that Discover can you load it once or twice and click on the inspect button and show what the query time and the round trip time is

Just gathering some data...

Then I might have some suggestions.

Here you go. 24 hour query:


Response time:

{
  "took": 53365,
  "timed_out": false,
  "_shards": {
    "total": 8,
    "successful": 8,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 884992923,
    "max_score": null,

1 hour query:


Response time:

{
  "took": 9692,
  "timed_out": false,
  "_shards": {
    "total": 8,
    "successful": 8,
    "skipped": 6,
    "failed": 0
  },
  "hits": {
    "total": 36796963,
    "max_score": null,

Thanks

So one of the first things that jumps out at me is the amount of data seems high...
50 GB per 3.5 Hours = ~ 350GB / Day for 1000 Hosts
Thats about 350MB per host per day, that seems high but I need to validate.
I am running a test to see what I get, with the system metrics set defaults but I added diskio too (I can't run metrics without diskio :slight_smile: ) but initial estimates are about 1/3 of that 100-120mb / host / per day.

And from your 1 hour discovery query ...

That also seems high because when I take the number of docs in an hour36796963 and divide by 1000 hosts that is is still 36K docs per hour per host. That is about 3x what I would expect I see about 10K per hour per host with the default setting 10s and 1m for filesystem etc.

All that said a properly scaled / architected / configured cluster should be able to ingest and query the data volumes you are working with. But before we get to that lets make sure the ingest / scale is correct .

One part of your post says 10s, the other says 20s, and did you make any other so I am not really clears what you have configured on the metricbeat side.

You could run this query and share the results, you would want to run this when the last hour is complete... if not you will need to change the range filter to something like this when you have a complete set.

          "range": {
            "@timestamp": {
              "gte": "2021-07-11T02:40:02.000Z",
              "lt":  "2021-07-11T03:40:02.000Z"
            }
          }

Here is the query, share the results.

GET metricbeat-7.12.1-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h/s",
              "lt": "now"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "doc_count": {
      "value_count": {
        "field": "_id"
      }
    },
    "host_count": {
      "cardinality": {
        "field": "host.name"
      }
    },
    "docs_per_metricset_count": {
      "terms": {
        "field": "metricset.name"
      }
    },
    "min_date": {
      "min": {
        "field": "@timestamp",
        "format": "strict_date_optional_time"
      }
    },
    "max_date": {
      "max": {
        "field": "@timestamp",
        "format": "strict_date_optional_time"
      }
    }
  }
}
1 Like

This is what i got:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      },
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      },
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      },
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      },
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      },
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      },
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029045273/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      },
      {
        "type" : "exception",
        "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]"
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.10-000003",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029047894,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      },
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.10-000004",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029047894,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      },
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.10-000005",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029047894,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      },
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.10-000006",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029044606,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      },
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.10-000007",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029044606,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      },
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.10-000008",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029044606/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029044606,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      },
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.10-000009",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029045273/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029045273/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029045273/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029045273,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      },
      {
        "shard" : 0,
        "index" : "metricbeat-7.12.1-2021.07.11-000010",
        "node" : "I92XZ9GFThaN5pZkPB9N6Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
          "caused_by" : {
            "type" : "execution_exception",
            "reason" : "CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]]",
            "caused_by" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[fielddata] Data too large, data for [_id] would be [12029047894/11.2gb], which is larger than the limit of [12025908428/11.1gb]",
              "bytes_wanted" : 12029047894,
              "bytes_limit" : 12025908428,
              "durability" : "PERMANENT"
            }
          }
        }
      }
    ]
  },
  "status" : 500
}

Thats interesting... You are running on 1 node now right?

Try this simpler version


GET metricbeat-7.12.1-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h/s",
              "lt": "now"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "host_count": {
      "cardinality": {
        "field": "host.name"
      }
    },
    "docs_per_metricset_count": {
      "terms": {
        "field": "metricset.name",
        "size": 20
      }
    }
  }
}

Yes. Now it works:

{
  "took" : 2680,
  "timed_out" : false,
  "_shards" : {
    "total" : 7,
    "successful" : 7,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "docs_per_metricset_count" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "diskio",
          "doc_count" : 24676868
        },
        {
          "key" : "core",
          "doc_count" : 5215241
        },
        {
          "key" : "process",
          "doc_count" : 2592610
        },
        {
          "key" : "network",
          "doc_count" : 2288555
        },
        {
          "key" : "cpu",
          "doc_count" : 658714
        },
        {
          "key" : "load",
          "doc_count" : 329356
        },
        {
          "key" : "memory",
          "doc_count" : 329356
        },
        {
          "key" : "process_summary",
          "doc_count" : 329354
        },
        {
          "key" : "filesystem",
          "doc_count" : 111863
        },
        {
          "key" : "fsstat",
          "doc_count" : 10985
        },
        {
          "key" : "uptime",
          "doc_count" : 3662
        },
        {
          "key" : "status",
          "doc_count" : 360
        }
      ]
    },
    "host_count" : {
      "value" : 916
    }
  }
}

Interesting... Your diskio document counts are very high there must be many many disks on each host they are about 3-5X what I would expect.

But other numbers see ok I guess it still seems high for 1000 metricbeat hosts.

So here are some of the things I would do, others may have other opinions.

You can try this is on your 1 node

You already have a lot of segments (underlying filesystem data structures) You are not force merging when you roll over so you are generating segment that are beginning to add up. Los of segments = slow queries, you already have 322 segments when you only had 22 shards.

In the ILM Policy On Rollover set force merge to 1 segment ,

You can see your segments with

GET _cat/segments/metricbeat-*/?v

You can clean this up by running the following command this may take 1 or more hours to run, as there is only 1 merge thread per node.

POST metricbeat-*/_forcemerge/?max_num_segments=1

it is a synchronous command but you can just run another command and check the results.

GET _cat/segments/metricbeat-*/?v

Once the segments are merged there will be only 1 per shard.

But over all... if you are really going to ingest and query 350GB / day or more, I would probably run more than a single node. Here are some suggestions, others may have other suggestions.

350GB / Day is non-trivial but we certainly have many use cases with Multiple TBs per day, its about proper scaling.

I would run perhaps 3 nodes, Each with 28GB Heap 1-2 TB SSD
Index Template : 3 Primary Shards, 1 replica (technically this would be better with 6 node so each shards can be completely parallel, there is some math) (If you do not want replicas you can do that , but if you lose a node you will corrupt your index)
ILM Rollover 150GB or 1 day : This will make 3 x 50GB Shards, the shards should balance out and you will get some parallelism.
Force Merge on Hot Rollover to 1 segment.
Your indexing seems OKish there are some settings that could make that better like

"index": {
  "refresh_interval": "30s",
  "translog": {
    "flush_threshold_size": "2gb"
  }

Other considerations would be how long the retention which you have not mentioned os say you wanted to keep this for 7 Days = 350GB Day + 1 replica = 700 GB / Day * 7 Days = ~5TB Data.

Other consideration is the you have some bottleneck with the IOPs, but it that is direct attached SSD but I am not really familiar with AHCI

2 Likes

I will follow your advice and post the results this Monday. By the way, I have a question:

Replicas = + latency right?

Do you mean that the index will split every 150GB? Because the current value is every 50GB

I have "migrated" to SAN-mapped disks, due to hyperconvergence being saturated. So I have no IOPS performance issues :slight_smile:

Replicas can add some latency indexing but then can add some query performance So it's plus minus as you say.

What I'm saying is if in your index template you put 3 primary shards and then tell ILM to roll over at 150gbs for the index, Then each of the 3 shards will be ~50 GB, 3 shards 50 GB = 150GB index.

That is one strategy... Also just single index with shard at 50 GB can also work as you have more nodes you'll get some parallelism.

I will have to take your word on the disk in general we would not recommend any SAN / any network attached storage.

Because one symptom of slow disk access are slow query times.... seems like you're writing fast enough,

1 Like

I really appreciate your time and help. Thanks :+1:

1 Like