Stack Monitoring - Elasticsearch - Elastic

Hello World!

Via Kibana -> Stack Monitoring

Any ideas what causing so many spaces in between, why the line isn't solid?

Thanks in advance!

Are you using the legacy self-monitoring or metricbeat to collect the metrics?

This normally happens when the service (elasticsearch, logstash etc) is under heavy load and doesn't answer the metric requests.

Zoom out to at least 30 minutes. What versions?

thank you for looking into my topic

  • regards to the metrics: I do believe the cluster is still using legacy self-monitoring at the moment..
  • regards to the load: I don't think the elasticsearch cluster is under heavy load.

how do I check the cluster's load anyways?

An interesting observation is that if I zoom out to 30mins instead of the default 15mins, there is no break in the lines at all, yet when I zoom back in, there are a lot of blank spaces in that line... it's almost like it's not rendering it properly w/ 15 mins for whatever reason, but the data is there...

the version is: 7.17.5

Usually when I see this it means the ES master is having trouble keeping up. Doubly so when using internal-monitoring since it needs to do all the work of gathering and shipping data to the monitoring cluster.

Things I'd watch for:

  • GC warnings in the ES logs
  • long running _cat/tasks
  • master node hot threads & circuit breakers
  • Large number of fields relative to master memory

That "many fields" has been a hard one for me since it sometimes won't manifest via any log.

It's just that when fields get added the master needs to do that work in addition to everything else, so everything just gets just that little bit slower (monitoring reporting, indexing new docs, etc).

Hope that helps!

  • GC warnings in the elasticsearch log:

{"type": "server", "timestamp": "2022-08-26T15:25:56,579Z", "level": "WARN", "component": "o.e.i.SystemIndexManager", "cluster.name": "elastic", "node.name": "elastic-es-master-2", "message": "Missing _meta field in mapping [doc] of index [.watches], assuming mappings update required", "cluster.uuid": "90tky-UNTRCfA42cAFK-aw", "node.id": "HqIuhom5RDOxyN7I8_rfwg" }

  • long running _cat/tasks
data_frame/transforms[c]              sXu62qwyQvK-sty1vjXWIQ:257560   cluster:117                     persistent 1661283558215 19:39:18 2.9d        10.202.77.2   elastic-es-data-9
geoip-downloader[c]                   snl_FAzBQcaQ43HEDNBCEA:308961   cluster:121                     persistent 1661284170370 19:49:30 2.9d        10.202.78.130 elastic-es-data-8
data_frame/transforms[c]              snl_FAzBQcaQ43HEDNBCEA:674625   cluster:123                     persistent 1661286361074 20:26:01 2.8d        10.202.78.130 elastic-es-data-8
  • master node hot threads & circuit breakers

i'm not sure how to check for that, if you could advise I would appreciate that :wink:
I did check /_cat/thread_pool and most I see has all three 0s.

  • Large number of fields relative to master memory

also not sure how to check for that either(

I did check indexing via monitoring Kibana app and it seems low (the highest index is only ~50/s)

Please advise) and "thank you" in advance!

That warning might be fine, though hopefully it's only happening once.

GC warnings will look more like this:

[gc][359] overhead, spent [7.5s] collecting in the last [8.3s]

Also those tasks are expected to be long running. What you'll want to watch for are long running write or search tasks.

Hot threads API doc is at Nodes hot threads API | Elasticsearch Guide [8.4] | Elastic, good that thread pool is low. Rejections are the thing to watch for there. See cat thread pool API | Elasticsearch Guide [8.4] | Elastic for info.

For field count I usually check the field caps API. Field capabilities API | Elasticsearch Guide [8.4] | Elastic

curl -s -u USER:PASS https://ES/_field_caps'?fields=*' | jq '.fields | keys | length'

As that number grows, the master can slow down especially if a lot of fields are getting added frequently. One time I saw this happen because http request header names were being added as fields.

I've never seen very specific measurement/guidance but generally if I see more than 10k per 1gb of master heap, I might look for ways to reduce the field count.

If you're able, try giving the master node more heap for starters.

If that helps make the graphs consistent then you know where to focus attention (alleviating master pressure).

GC warnings will look more like this:

I checked master's "Elasticsearch" log and did not find warnings like the one you listed, the only warning I did find is the one that I listed previously and I also did research for it and it's safe to ignore.

cluster-nodes-hot-threads:

   100.3% [cpu=100.3%, other=0.0%] (501.2ms out of 500ms) cpu usage by thread 'elasticsearch[elastic-es-data-11][[.monitoring-es-7-2022.08.29][0]: Lucene Merge Thread #2343]'

&

   100.4% [cpu=100.4%, other=0.0%] (501.8ms out of 500ms) cpu usage by thread 'elasticsearch[elastic-es-data-1][system_read][T#2]'

&

   100.0% [cpu=27.6%, other=72.4%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[elastic-es-coor-0][transport_worker][T#4]'

&

   100.3% [cpu=100.3%, other=0.0%] (501.6ms out of 500ms) cpu usage by thread 'elasticsearch[elastic-es-data-9][write][T#9]'

threat_pool

low (almost all 0s)

field counts:

% curl -s -k -u elastic:$es_password https://localhost:9200/_field_caps'?fields=*' | jq '.fields | keys | length'
12688
%

after killing auditbeat-*, filebeat-*, metricbeat-* and logstash-*, now it's:

% curl -s -k -u elastic:$es_password https://localhost:9200/_field_caps'?fields=*' | jq '.fields | keys | length' 
4806
% 

heap

master node has heap of 14GB and using under 9GB..

100% CPU in merge for .monitoring-es-7-2022.08.29 is curious. I wonder if the nodes handling that index are struggling rather than the master. It might help to try adding shards to that index.

The quickest way I know to do that is to update the template and delete the index, then when new documents show up it'll have more shards.

Note that it's a bit destructive, so you may want to make sure you have backups.

I don't have an internal-collection cluster handy right now to provide exact steps, but hopefully the above can help you find the underlying issue.

there are several data nodes and each of the data node is running on GCP: n1-standard-16 (vCPU:16 RAM:60GB), heap size is 31gb on each of the data nodes.

the busiest node is using during the peaks about 4 vCPU (out of 16vCPU)

I checked and previous day for .monitoring-es-7-YYYY.MM.DD and it is about 10gb in size and about 10M of docs and it has 1 primary and 1 replica shard.

Yeah, that all sounds healthy. If you only see it at 15min zoom, I guess it's possible we have some rendering bug.

If you're able, have a look after updating to a more recent version and if it persists, we can move this to a github issue for further investigation.

I had 7.17 booted today so I enabled internal collection and 15 min interval seems okay so far.

I doubt this is a bug, rather some misconfiguration within the cluster (it's like looking for the needle in a haystack :upside_down_face: )

% curl --silent --insecure https://elastic:$es_password@localhost:9200/_nodes/_all/jvm | jq | grep using_compressed_ordinary_object_pointers | uniq
        "using_compressed_ordinary_object_pointers": "true",
% 
root@elastic-es-data-1:/usr/share/elasticsearch# sysctl vm.max_map_count
vm.max_map_count = 65530
root@elastic-es-data-1:/usr/share/elasticsearch# 
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.