How to get 95th percentile and 99th percentile for response-times?

We want to extract performance-metrics from Elastic-Search using telegraf or any API from Elastic-Search.
But it seems that both telegraf and elastic-search APIs have no way to find that.

They report the following:

"search": {
           "open_contexts": 0,
           "query_total": 123,
           "query_time_in_millis": 531,
           "query_current": 0,
           "fetch_total": 3,
           "fetch_time_in_millis": 55,
           "fetch_current": 0
}

Here query_time_in_millis is a cumulative value since the node-start. Even if we use some graphite/grafana functions to arrive at the rate of this value, it will give us the time taken by all the queries-per-minute but it would be really nice if this value was reported more like a 95th-percentile and/or 99th-percentile so that we can state with confidence that 95% of queries completed in less than X milliseconds. (Helps to meet SLAs)

Dividing query_time_in_millis/query_total does not achieve that.
Dividing delta(query_time_in_millis)/delta(query_total) also does not achieve that.

Plus some charting tools like grafana do not support division of metrics.

Is there any way to achieve the 95th/99th percentile of latencies?
Its also not clear what does query_current stand for. Is it the rate of requests per minute, per second or per 5 minute etc?

Some extra context: We are using https://metrics.dropwizard.io/4.0.0/ in many of our projects and that has really good ways of reporting such metrics.

The node stats are more for high-level monitoring of individual nodes, more like snapshots to get a feel for how a node is doing. If you care specifically about query latency, I'd probably instrument that myself in my app (log the query duration). That will include network transfer, query and fetch phase, etc.

If you just go by the query_ stats in node stats you're only looking at the query phase for each individual node, which doesn't exactly represent the situation. A good extreme example is if nodes have 10ms 99th percentile latency, but 1s 100th percentile.

If you have 100 shards in a query (extreme but simple math :wink: ) it's likely that looking at a per-node basis will show 10ms 99th percentile. But if you look at the overall query, since there are 100 shards involved at least one will hit the 100th percentile latency and you'll actually be closer to 1s all the time.

That said, I'd probably use the scripting functionality of the percentiles aggregation to divide query_time_in_millis / query_total and use that for the percentiles calculation.

The danger is that it's not very flexible, since you are limited to intervals at the logging rate (unless you want to start averaging percentiles, which is dangerous).

The alternative is using a set of pipeline aggs to calculate the percentiles of a window of time (date_histo -> avg -> derivative agg -> bucket_script to divide -> percentiles_bucket). Useful, but also a slightly different style of metric.

That's the number of queries at this exact point in time, when the API was called. E.g. the number of queries actively processing right now.

@polyfractal, several other datastores have the 95th/99th percentile kind of latency reporting. Here are a few example links:

  1. cassandra
  2. solr
  3. storm
  4. kafka-broker

Basically, the dropwizard metrics have become very popular in many open-source systems since their metric-reporting is pretty intuitive and helpful.
So if ES implements things like timers, each node in ES would report its own 95th-percentile and 99th-percentile which is very useful than looking at query_time_in_millis / query_total. Plus many charting tools do not support the division operation and so you have to write your own custom code in the middle (i.e. between ES and charting tool) just to do a small division which is not difficult to write and becomes difficult to maintain operationally. Plus it would be of great help when connecting through JMX.

@polyfractal, as a Elastic developer, let us know if you see value in the above proposal and we can create a ticket in github/jira for this. The dropwizard metrics's constructs like timers, counters, meters etc are powerful and simple and in use by several other products.
Disclaimer: I am not related to Dropwizard in any way. Just a happy user of the product.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.