Hi @kielni Thanks for the feedback I will pass that on, I agree.
(Our self-manged offerings which I am not suggesting you change to have multiple ways to export the Elasticsearch metrics to the monitoring system of your choice, with respect to Cloud offering I think we are still working on it, strangely it takes a little more effort.)
Now on to your other questions
1st) I would not try use the the metrics charts under Deployments / [deployment] / Performance, tl;dr they not as reliable as the the metrics in the stack monitoring. (holdover from the past and were meant originally as a quick glance etc. apologies yes confusing. )
So I will only comparing to the charts within Stack Monitoring specifically to the totals and nodes etc... Stack Monitoring is the only capability I would use to inspect performance at this time.
2nd) With respect to your aggregations / queries and math, I think you are close but are using some incorrect fields / understanding.
"node_stats.indices.indexing.index_time_in_millis"
Is not an elaspsed time, from here
"index_time_in_millis: (integer) Total time in milliseconds spent performing indexing operations. "
index_time_in_millis
it is the time actually spent indexing the documents, it is used in calculated avg time for indexing operations it is not the elapsed time, that goes for query etc... so this is not the metric you should divide by to get the indexing operations / sec (per time value).
so you should be using the elapsed time for index / sec or query / sec it would be
Here is mine... and these line up with what I see in stack monitoring and make sense.
GET .monitoring-es-7-mb-2021.03.02/_search
{
"aggs": {
"node": {
"terms": {
"field": "source_node.name",
"size": 5
},
"aggs": {
"from_ts": {
"min": {
"field": "timestamp"
}
},
"to_ts": {
"max": {
"field": "timestamp"
}
},
"from_index_count": {
"min": {
"field": "node_stats.indices.indexing.index_total"
}
},
"to_index_count": {
"max": {
"field": "node_stats.indices.indexing.index_total"
}
},
"from_index_time_ms": {
"min": {
"field": "node_stats.indices.indexing.index_time_in_millis"
}
},
"to_index_time_ms": {
"max": {
"field": "node_stats.indices.indexing.index_time_in_millis"
}
},
"from_search_count": {
"min": {
"field": "node_stats.indices.search.query_total"
}
},
"to_search_count": {
"max": {
"field": "node_stats.indices.search.query_total"
}
},
"from_search_time": {
"min": {
"field": "node_stats.indices.search.query_time_in_millis"
}
},
"to_search_time": {
"max": {
"field": "node_stats.indices.search.query_time_in_millis"
}
},
"sum_index_time": {
"sum": {
"field": "node_stats.indices.indexing.index_time_in_millis"
}
},
"sum_query_time": {
"sum": {
"field": "node_stats.indices.search.query_time_in_millis"
}
},
"heap_used_percent": {
"avg": {
"field": "node_stats.jvm.mem.heap_used_percent"
}
}
}
}
},
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"type": "node_stats"
}
},
{
"term": {
"cluster_uuid": "asasdfasdfsadfasasfdasdf"
}
},
{
"term": {
"source_node.name": {
"value": "instance-0000000073"
}
}
},
{
"range": {
"timestamp": {
"gte": "now-5m"
}
}
}
]
}
}
}
# Results
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 30,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"node" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "instance-0000000073",
"doc_count" : 30,
"sum_index_time" : {
"value" : 1.271088426E9
},
"from_search_count" : {
"value" : 1.1329617E7
},
"heap_used_percent" : {
"value" : 58.7
},
"from_index_time_ms" : {
"value" : 4.2366507E7
},
"to_ts" : {
"value" : 1.614657039624E12,
"value_as_string" : "2021-03-02T03:50:39.624Z"
},
"sum_query_time" : {
"value" : 2.61023127E8
},
"to_search_time" : {
"value" : 8702127.0
},
"to_search_count" : {
"value" : 1.133298E7
},
"to_index_count" : {
"value" : 5.28768085E8
},
"from_ts" : {
"value" : 1.614656749623E12,
"value_as_string" : "2021-03-02T03:45:49.623Z"
},
"to_index_time_ms" : {
"value" : 4.2372696E7
},
"from_index_count" : {
"value" : 5.28713453E8
},
"from_search_time" : {
"value" : 8699523.0
}
}
]
}
}
}
The elapsed time need to be a bit careful but I took the difference in timestamps technically these are 10s collection buckets.
Name | Value |
---|---|
From Index Count | 528,713,453 |
To Index Count | 528,768,085 |
Delta Index (Number of Indexing Events | 54,632 |
Elapased Time sec (difference in time stamps) | 290 |
Indexing Events / Sec (correct) | 188 |
From Index Count | 528,713,453 |
To Index Count | 528,768,085 |
Delta Index (Number of Indexing Events | 54,632 |
From indexing time | 42,366,507 |
To indexing time | 42,372,696 |
Delta Indexing TIme ms | 6,189 |
Avg Indexing Time ms / request (correct) | 0.1133 |
The index_total
, query_total
are monotonically increasing counters like many other metrics in systems like I/O bytes_in, bytes_out etc and the typically they are graphed as a rate, these do have actuall time buckets associated with them of 10s, but the timestamps can generally be used as a good proxy. (Not sure that helps or not)
Hope that helps a bit.