Exporting Elastic cloud cluster performance stats

I would like to make use of the the metrics available from Cloud / Deployments / [deployment] / Performance. The metrics are interesting, but the charts are difficult to use. I use Datadog for monitoring, and want to be able to see my Elasticsearch metrics in context with other systems that drive the load.

How can I get these metrics out of the Performance monitoring page in an automated way? Are they available via an API call or some other method? I am already exporting production cluster metrics to a monitoring cluster, but I don't see the per-node Number of Requests, Search Response Times, and Memory Pressure numbers.

Hi @kielni Welcome to the community and thanks for trying Elastic Cloud.

If you have set up shipping metrics to a monitoring cluster like this.

Then I think all of the metrics you are looking for are there on the advanced tabs of the nodes in stack monitoring, there are quite a few detailed metrics for each node...

Depending on what versions you are on the data will be stored in systems indices in the cluster that you are shipping the metrics too that stare with a .

You will need to set up something / a job to pull those metrics into into DataDog, how you want to do that I am not sure lots of ways the data is available via all the Standard Elastisearch REST APIs... of course perhaps someday could push those other systems metrics into Elasticsearch and see them all together :slight_smile:

If you are on fairly recent version then you can run something like

GET _cat/indices/.monitoring-es-*/?v&s=index:desc

And here is a sample doc, there are a lot of info in the indices.

GET .monitoring-es-7-mb-2021.02.28/_search
{
  "query": {
    "term": {
      "type" : {
        "value": "node_stats"
      }
    }
  }
}
{
  "_index": ".monitoring-es-7-mb-2021.02.28",
  "_type": "_doc",
  "_id": "g27I6XcBMFH355-yEXuN",
  "_version": 1,
  "_score": null,
  "_source": {
    "@timestamp": "2021-02-28T17:55:37.728Z",
    "event": {
      "duration": 69043664,
      "dataset": "elasticsearch.node.stats",
      "module": "elasticsearch"
    },
    "type": "node_stats",
    "node_stats": {
      "fs": {
        "total": {
          "total_in_bytes": 257698037760,
          "free_in_bytes": 92659236864,
          "available_in_bytes": 92659236864
        },
        "io_stats": {}
      },
      "jvm": {
        "gc": {
          "collectors": {
            "young": {
              "collection_time_in_millis": 2866340,
              "collection_count": 94667
            },
            "old": {
              "collection_count": 94667,
              "collection_time_in_millis": 2866340
            }
          }
        },
        "mem": {
          "heap_max_in_bytes": 4106223616,
          "heap_used_in_bytes": 3119076864,
          "heap_used_percent": 75
        }
      },
      "mlockall": false,
      "os": {
        "cpu": {
          "load_average": {
            "15m": 1.16,
            "1m": 1.14,
            "5m": 1.36
          }
        },
        "cgroup": {
          "cpu": {
            "control_group": "/",
            "cfs_period_micros": 100000,
            "cfs_quota_micros": 266666,
            "stat": {
              "number_of_times_throttled": 703402,
              "time_throttled_nanos": 117860160080464,
              "number_of_elapsed_periods": 8697138
            }
          },
          "memory": {
            "limit_in_bytes": "8589934592",
            "usage_in_bytes": "8226234368",
            "control_group": "/"
          },
          "cpuacct": {
            "control_group": "/",
            "usage_nanos": 731799111098956
          }
        }
      },
      "thread_pool": {
        "get": {
          "queue": 0,
          "rejected": 0,
          "threads": 0
        },
        "management": {
          "threads": 5,
          "queue": 0,
          "rejected": 0
        },
        "search": {
          "rejected": 0,
          "threads": 4,
          "queue": 0
        },
        "watcher": {
          "rejected": 0,
          "threads": 0,
          "queue": 0
        },
        "write": {
          "threads": 2,
          "queue": 0,
          "rejected": 0
        },
        "generic": {
          "threads": 93,
          "queue": 0,
          "rejected": 0
        }
      },
      "node_master": true,
      "node_id": "QtKLeIcuTWuHdo8uLOcLUw",
      "indices": {
        "store": {
          "size_in_bytes": 163324922738
        },
        "indexing": {
          "throttle_time_in_millis": 0,
          "index_total": 391884858,
          "index_time_in_millis": 39084616
        },
        "search": {
          "query_total": 5653491,
          "query_time_in_millis": 7069959
        },
        "query_cache": {
          "memory_size_in_bytes": 40576810,
          "hit_count": 858520,
          "miss_count": 2970609,
          "evictions": 84405
        },
        "docs": {
          "count": 524023513
        },
        "fielddata": {
          "memory_size_in_bytes": 159200,
          "evictions": 0
        },
        "segments": {
          "fixed_bit_set_memory_in_bytes": 15284968,
          "term_vectors_memory_in_bytes": 0,
          "norms_memory_in_bytes": 197824,
          "doc_values_memory_in_bytes": 25912280,
          "count": 2584,
          "stored_fields_memory_in_bytes": 1484320,
          "version_map_memory_in_bytes": 985056,
          "memory_in_bytes": 44418528,
          "points_memory_in_bytes": 0,
          "terms_memory_in_bytes": 16824104,
          "index_writer_memory_in_bytes": 21738132
        },
        "request_cache": {
          "memory_size_in_bytes": 40068049,
          "evictions": 18668,
          "hit_count": 1008716,
          "miss_count": 75787
        }
      },
      "process": {
        "max_file_descriptors": 1048576,
        "cpu": {
          "percent": 6
        },
        "open_file_descriptors": 3050
      }
    },
    "metricset": {
      "name": "node_stats",
      "period": 10000
    },
    "interval_ms": 10000,
    "timestamp": "2021-02-28T17:55:37.797Z",
    "source_node": {
      "name": "instance-0000000075",
      "transport_address": "10.44.255.143:19936",
      "uuid": "QtKLeIcuTWuHdo8uLOcLUw"
    },
    "service": {
      "address": "7d08aea6f01c:18565",
      "type": "elasticsearch"
    },
    "cluster_uuid": "lPhIKHfzSGO52N-k2eXlBQ",
    "ecs": {
      "version": "1.5.0"
    },
    "host": {
      "name": "7d08aea6f01c"
    },
    "agent": {
      "type": "metricbeat",
      "version": "7.9.2",
      "hostname": "7d08aea6f01c",
      "ephemeral_id": "91c6f677-023f-4f17-9462-03d66aa55a9a",
      "id": "a109ab63-6433-4c13-8ced-7ed77da5082c",
      "name": "7d08aea6f01c"
    }
  },
  "fields": {
    "timestamp": [
      "2021-02-28T17:55:37.797Z"
    ]
  },
  "sort": [
    1614534937797
  ]
}

Hope this helps ..

Yes, that is helpful, thanks. Yes, it would be nice if Elasticsearch cooperated better with other services; it's one of the most frustrating things about using it.

I have this query to get the stats:

GET .monitoring-es*/_search
{
  "aggs": {
    "node": {
      "terms": {
        "field": "source_node.name",
        "size": 5
      },
      "aggs": {
        "from_ts": {
          "min": {
            "field": "timestamp"
          }
        },
        "to_ts": {
          "max": {
            "field": "timestamp"
          }
        },
        "from_index_count": {
          "min": {
            "field": "node_stats.indices.indexing.index_total"
          }
        },
        "to_index_count": {
          "max": {
            "field": "node_stats.indices.indexing.index_total"
          }
        },
        "from_index_time": {
          "min": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "to_index_time": {
          "max": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "from_search_count": {
          "min": {
            "field": "node_stats.indices.search.query_total"
          }
        },
        "to_search_count": {
          "max": {
            "field": "node_stats.indices.search.query_total"
          }
        },
        "from_search_time": {
          "min": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
        "to_search_time": {
          "max": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
        "heap_used_percent": {
          "avg": {
            "field": "node_stats.jvm.mem.heap_used_percent"
          }
        }
      }
    }
  },
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "node_stats"
          }
        },
        {
          "term": {
            "cluster_uuid": "abc123"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "now-5m"
            }
          }
        }
      ]
    }
  }
}

That returns data like this (one per node, removed the others for brevity):

        {
          "key" : "instance-0000000010",
          "doc_count" : 29,
          "to_index_time" : {
            "value" : 619690.0
          },
          "from_search_count" : {
            "value" : 1863698.0
          },
          "heap_used_percent" : {
            "value" : 67.10344827586206
          },
          "from_index_time" : {
            "value" : 619525.0
          },
          "to_ts" : {
            "value" : 1.614613849335E12,
            "value_as_string" : "2021-03-01T15:50:49.335Z"
          },
          "to_search_time" : {
            "value" : 316045.0
          },
          "to_search_count" : {
            "value" : 1865266.0
          },
          "to_index_count" : {
            "value" : 509566.0
          },
          "from_ts" : {
            "value" : 1.614613569326E12,
            "value_as_string" : "2021-03-01T15:46:09.326Z"
          },
          "from_index_count" : {
            "value" : 509433.0
          },
          "from_search_time" : {
            "value" : 315787.0
          }
        }

Then I calculate the metrics like this:

print("node\ttype\treq\tms/req")
for bucket in data["aggregations"]["node"]["buckets"]:
    node = bucket["key"]
    if "tiebreaker" in node:
        continue
    sec = (bucket["to_ts"]["value"] - bucket["from_ts"]["value"]) / 1000
    req = bucket["to_search_count"]["value"] - bucket["from_search_count"]["value"]
    time = bucket["to_search_time"]["value"] - bucket["from_search_time"]["value"]
    print(f"{node}\tsearch\t{req}\t{round(time/req)}")
    sec = (bucket["to_ts"]["value"] - bucket["from_ts"]["value"]) / 1000
    req = bucket["to_index_count"]["value"] - bucket["from_index_count"]["value"]
    time = bucket["to_index_time"]["value"] - bucket["from_index_time"]["value"]
    print(f"{node}\tindex\t{req}\t{round(time/req)}")

which calculates

node	type	req	ms/req
instance-0000000010	search	1586.0	0
instance-0000000010	index	2792.0	1
instance-0000000011	search	1588.0	0
instance-0000000011	index	2806.0	1

but these don't seem to match the numbers on the deployments / Performance page:

Screen Shot 2021-03-01 at 8.26.14 AM
Number of requests shows search = 487, index = 147
Search response time (ms) avg = 5
Index response time (ms) avg = 363

There are no units on these charts, so I can't tell if these are per second; the tooltips seem to indicate 5 minute buckets, and I used a 5 minute window in my query. Also, the indexing number (from max(node_stats.indices.indexing.index_total) - min(node_stats.indices.indexing.index_total)) is almost double the search number (from max(node_stats.indices.search.query_total) - min(node_stats.indices.search.query_total)), but the chart shows a search rate of about 3x the indexing rate.

How can I get the number index and search requests in a time window? I thought could get it from subtracting the max total (index_total or query_total) from the min for a period, but that seems like it's way too high. Also, how do I roll these up to a cluster-level metric?

Hi @kielni Thanks for the feedback I will pass that on, I agree.
(Our self-manged offerings which I am not suggesting you change to have multiple ways to export the Elasticsearch metrics to the monitoring system of your choice, with respect to Cloud offering I think we are still working on it, strangely it takes a little more effort.)

Now on to your other questions
1st) I would not try use the the metrics charts under Deployments / [deployment] / Performance, tl;dr they not as reliable as the the metrics in the stack monitoring. (holdover from the past and were meant originally as a quick glance etc. apologies yes confusing. )

So I will only comparing to the charts within Stack Monitoring specifically to the totals and nodes etc... Stack Monitoring is the only capability I would use to inspect performance at this time.

2nd) With respect to your aggregations / queries and math, I think you are close but are using some incorrect fields / understanding.

"node_stats.indices.indexing.index_time_in_millis"

Is not an elaspsed time, from here

"index_time_in_millis: (integer) Total time in milliseconds spent performing indexing operations. "

index_time_in_millis it is the time actually spent indexing the documents, it is used in calculated avg time for indexing operations it is not the elapsed time, that goes for query etc... so this is not the metric you should divide by to get the indexing operations / sec (per time value).

so you should be using the elapsed time for index / sec or query / sec it would be

Here is mine... and these line up with what I see in stack monitoring and make sense.

GET .monitoring-es-7-mb-2021.03.02/_search
{
  "aggs": {
    "node": {
      "terms": {
        "field": "source_node.name",
        "size": 5
      },
      "aggs": {
        "from_ts": {
          "min": {
            "field": "timestamp"
          }
        },
        "to_ts": {
          "max": {
            "field": "timestamp"
          }
        },
        "from_index_count": {
          "min": {
            "field": "node_stats.indices.indexing.index_total"
          }
        },
        "to_index_count": {
          "max": {
            "field": "node_stats.indices.indexing.index_total"
          }
        },
        "from_index_time_ms": {
          "min": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "to_index_time_ms": {
          "max": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "from_search_count": {
          "min": {
            "field": "node_stats.indices.search.query_total"
          }
        },
        "to_search_count": {
          "max": {
            "field": "node_stats.indices.search.query_total"
          }
        },
        "from_search_time": {
          "min": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
        "to_search_time": {
          "max": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
       "sum_index_time": {
          "sum": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "sum_query_time": {
          "sum": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
        "heap_used_percent": {
          "avg": {
            "field": "node_stats.jvm.mem.heap_used_percent"
          }
        }
      }
    }
  },
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "node_stats"
          }
        },
        {
          "term": {
            "cluster_uuid": "asasdfasdfsadfasasfdasdf"
          }
        },
        {
          "term": {
            "source_node.name": {
              "value": "instance-0000000073"
            }
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "now-5m"
            }
          }
        }
      ]
    }
  }
}

# Results

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 30,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "node" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "instance-0000000073",
          "doc_count" : 30,
          "sum_index_time" : {
            "value" : 1.271088426E9
          },
          "from_search_count" : {
            "value" : 1.1329617E7
          },
          "heap_used_percent" : {
            "value" : 58.7
          },
          "from_index_time_ms" : {
            "value" : 4.2366507E7
          },
          "to_ts" : {
            "value" : 1.614657039624E12,
            "value_as_string" : "2021-03-02T03:50:39.624Z"
          },
          "sum_query_time" : {
            "value" : 2.61023127E8
          },
          "to_search_time" : {
            "value" : 8702127.0
          },
          "to_search_count" : {
            "value" : 1.133298E7
          },
          "to_index_count" : {
            "value" : 5.28768085E8
          },
          "from_ts" : {
            "value" : 1.614656749623E12,
            "value_as_string" : "2021-03-02T03:45:49.623Z"
          },
          "to_index_time_ms" : {
            "value" : 4.2372696E7
          },
          "from_index_count" : {
            "value" : 5.28713453E8
          },
          "from_search_time" : {
            "value" : 8699523.0
          }
        }
      ]
    }
  }
}

The elapsed time need to be a bit careful but I took the difference in timestamps technically these are 10s collection buckets.

Name Value
From Index Count 528,713,453
To Index Count 528,768,085
Delta Index (Number of Indexing Events 54,632
Elapased Time sec (difference in time stamps) 290
Indexing Events / Sec (correct) 188
From Index Count 528,713,453
To Index Count 528,768,085
Delta Index (Number of Indexing Events 54,632
From indexing time 42,366,507
To indexing time 42,372,696
Delta Indexing TIme ms 6,189
Avg Indexing Time ms / request (correct) 0.1133

The index_total, query_total are monotonically increasing counters like many other metrics in systems like I/O bytes_in, bytes_out etc and the typically they are graphed as a rate, these do have actuall time buckets associated with them of 10s, but the timestamps can generally be used as a good proxy. (Not sure that helps or not)

Hope that helps a bit.

Thanks, that example is really helpful. Now that I have a better understanding of what the fields are, I've got it working with the results of /_nodes/stats, and can drop all the complexity of shipping the metrics to another cluster. I also realized I can just send the data points to Datadog, and it can calculate the rates just like it does for the metrics provided by their Elasticsearch integration.

1 Like

Totally Nice!!! Glad we could help.

and some day, perhaps take a look at Elasticsearch as a metrics store :slight_smile: You might find we do a great job especially if you want to do a lot of tagging and aggregations.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.