Exporting Elastic cloud cluster performance stats

stephenb · March 2, 2021, 3:57am

Hi @kielni Thanks for the feedback I will pass that on, I agree.
(Our self-manged offerings which I am not suggesting you change to have multiple ways to export the Elasticsearch metrics to the monitoring system of your choice, with respect to Cloud offering I think we are still working on it, strangely it takes a little more effort.)

Now on to your other questions
1st) I would not try use the the metrics charts under Deployments / [deployment] / Performance, tl;dr they not as reliable as the the metrics in the stack monitoring. (holdover from the past and were meant originally as a quick glance etc. apologies yes confusing. )

So I will only comparing to the charts within Stack Monitoring specifically to the totals and nodes etc... Stack Monitoring is the only capability I would use to inspect performance at this time.

2nd) With respect to your aggregations / queries and math, I think you are close but are using some incorrect fields / understanding.

"node_stats.indices.indexing.index_time_in_millis"

Is not an elaspsed time, from here

"index_time_in_millis: (integer) Total time in milliseconds spent performing indexing operations. "

index_time_in_millis it is the time actually spent indexing the documents, it is used in calculated avg time for indexing operations it is not the elapsed time, that goes for query etc... so this is not the metric you should divide by to get the indexing operations / sec (per time value).

so you should be using the elapsed time for index / sec or query / sec it would be

Here is mine... and these line up with what I see in stack monitoring and make sense.

GET .monitoring-es-7-mb-2021.03.02/_search
{
  "aggs": {
    "node": {
      "terms": {
        "field": "source_node.name",
        "size": 5
      },
      "aggs": {
        "from_ts": {
          "min": {
            "field": "timestamp"
          }
        },
        "to_ts": {
          "max": {
            "field": "timestamp"
          }
        },
        "from_index_count": {
          "min": {
            "field": "node_stats.indices.indexing.index_total"
          }
        },
        "to_index_count": {
          "max": {
            "field": "node_stats.indices.indexing.index_total"
          }
        },
        "from_index_time_ms": {
          "min": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "to_index_time_ms": {
          "max": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "from_search_count": {
          "min": {
            "field": "node_stats.indices.search.query_total"
          }
        },
        "to_search_count": {
          "max": {
            "field": "node_stats.indices.search.query_total"
          }
        },
        "from_search_time": {
          "min": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
        "to_search_time": {
          "max": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
       "sum_index_time": {
          "sum": {
            "field": "node_stats.indices.indexing.index_time_in_millis"
          }
        },
        "sum_query_time": {
          "sum": {
            "field": "node_stats.indices.search.query_time_in_millis"
          }
        },
        "heap_used_percent": {
          "avg": {
            "field": "node_stats.jvm.mem.heap_used_percent"
          }
        }
      }
    }
  },
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "node_stats"
          }
        },
        {
          "term": {
            "cluster_uuid": "asasdfasdfsadfasasfdasdf"
          }
        },
        {
          "term": {
            "source_node.name": {
              "value": "instance-0000000073"
            }
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "now-5m"
            }
          }
        }
      ]
    }
  }
}

# Results

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 30,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "node" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "instance-0000000073",
          "doc_count" : 30,
          "sum_index_time" : {
            "value" : 1.271088426E9
          },
          "from_search_count" : {
            "value" : 1.1329617E7
          },
          "heap_used_percent" : {
            "value" : 58.7
          },
          "from_index_time_ms" : {
            "value" : 4.2366507E7
          },
          "to_ts" : {
            "value" : 1.614657039624E12,
            "value_as_string" : "2021-03-02T03:50:39.624Z"
          },
          "sum_query_time" : {
            "value" : 2.61023127E8
          },
          "to_search_time" : {
            "value" : 8702127.0
          },
          "to_search_count" : {
            "value" : 1.133298E7
          },
          "to_index_count" : {
            "value" : 5.28768085E8
          },
          "from_ts" : {
            "value" : 1.614656749623E12,
            "value_as_string" : "2021-03-02T03:45:49.623Z"
          },
          "to_index_time_ms" : {
            "value" : 4.2372696E7
          },
          "from_index_count" : {
            "value" : 5.28713453E8
          },
          "from_search_time" : {
            "value" : 8699523.0
          }
        }
      ]
    }
  }
}

The elapsed time need to be a bit careful but I took the difference in timestamps technically these are 10s collection buckets.

Name	Value
From Index Count	528,713,453
To Index Count	528,768,085
Delta Index (Number of Indexing Events	54,632
Elapased Time sec (difference in time stamps)	290
Indexing Events / Sec (correct)	188


From Index Count	528,713,453
To Index Count	528,768,085
Delta Index (Number of Indexing Events	54,632
From indexing time	42,366,507
To indexing time	42,372,696
Delta Indexing TIme ms	6,189
Avg Indexing Time ms / request (correct)	0.1133

The index_total, query_total are monotonically increasing counters like many other metrics in systems like I/O bytes_in, bytes_out etc and the typically they are graphed as a rate, these do have actuall time buckets associated with them of 10s, but the timestamps can generally be used as a good proxy. (Not sure that helps or not)

Hope that helps a bit.

Topic		Replies	Views
Elastic Cloud performance data in Stack Monitoring Elasticsearch elastic-stack-monitoring	2	286	June 21, 2022
Metrics and Alerts for Elasticsearch Elasticsearch elastic-stack-monitoring	4	442	December 22, 2020
Elasticsearch search response time / latency metrics Elasticsearch elastic-stack-monitoring	2	942	February 17, 2023
Monitoring on Elasticsearch Elasticsearch elastic-stack-monitoring	2	442	March 15, 2021
Performance metrics reporting tools for ES Elasticsearch	8	807	July 6, 2017

Exporting Elastic cloud cluster performance stats

Related topics