[esrally] what cause difference of latency and service time?, what is proper way of custom parameter

Hello. i just saw strange stuff when im trying to do single shard test.

here is my metric it's error rate is 0.01% and its service time looks reasonable to service, but latency is really bad.

the question is

  1. what cause the difference of latency and service time ?

  2. if latency is lot more than service time, does it make sense that error rate goes higher?

  3. to make query body, my "track.py" read csv file which has 1000,000 word. is it can be a petential problem ?

  4. if (2) is problem, how can i simulate real world query parameter ?


1.command and metric

cmd :esrally race --pipeline=benchmark-only --target-host=192.168.35.177:9200 --track-path=~/rallyCustomTrack/custom-track-search --track-params="t_qps:500,t_period:300" --report-file=~/{path}/report_500.json


------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |         Task |            Value |   Unit |
|---------------------------------------------------------------:|-------------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |              |      4.8841      |    min |
|             Min cumulative indexing time across primary shards |              |      0           |    min |
|          Median cumulative indexing time across primary shards |              |      2.44205     |    min |
|             Max cumulative indexing time across primary shards |              |      4.8841      |    min |
|            Cumulative indexing throttle time of primary shards |              |      0           |    min |
|    Min cumulative indexing throttle time across primary shards |              |      0           |    min |
| Median cumulative indexing throttle time across primary shards |              |      0           |    min |
|    Max cumulative indexing throttle time across primary shards |              |      0           |    min |
|                        Cumulative merge time of primary shards |              |      0.910017    |    min |
|                       Cumulative merge count of primary shards |              |      7           |        |
|                Min cumulative merge time across primary shards |              |      0           |    min |
|             Median cumulative merge time across primary shards |              |      0.455008    |    min |
|                Max cumulative merge time across primary shards |              |      0.910017    |    min |
|               Cumulative merge throttle time of primary shards |              |      0.0854833   |    min |
|       Min cumulative merge throttle time across primary shards |              |      0           |    min |
|    Median cumulative merge throttle time across primary shards |              |      0.0427417   |    min |
|       Max cumulative merge throttle time across primary shards |              |      0.0854833   |    min |
|                      Cumulative refresh time of primary shards |              |      0.224667    |    min |
|                     Cumulative refresh count of primary shards |              |     23           |        |
|              Min cumulative refresh time across primary shards |              |      0           |    min |
|           Median cumulative refresh time across primary shards |              |      0.112333    |    min |
|              Max cumulative refresh time across primary shards |              |      0.224667    |    min |
|                        Cumulative flush time of primary shards |              |      0.000716667 |    min |
|                       Cumulative flush count of primary shards |              |      2           |        |
|                Min cumulative flush time across primary shards |              |      0           |    min |
|             Median cumulative flush time across primary shards |              |      0.000358333 |    min |
|                Max cumulative flush time across primary shards |              |      0.000716667 |    min |
|                                        Total Young Gen GC time |              |      0.966       |      s |
|                                       Total Young Gen GC count |              |    263           |        |
|                                          Total Old Gen GC time |              |      0           |      s |
|                                         Total Old Gen GC count |              |      0           |        |
|                                                     Store size |              |      0.799961    |     GB |
|                                                  Translog size |              |      1.02445e-07 |     GB |
|                                         Heap used for segments |              |      0.0501366   |     MB |
|                                       Heap used for doc values |              |      0.00166702  |     MB |
|                                            Heap used for terms |              |      0.03302     |     MB |
|                                            Heap used for norms |              |      0.00415039  |     MB |
|                                           Heap used for points |              |      0           |     MB |
|                                    Heap used for stored fields |              |      0.0112991   |     MB |
|                                                  Segment count |              |     23           |        |
|                                    Total Ingest Pipeline count |              |      0           |        |
|                                     Total Ingest Pipeline time |              |      0           |      s |
|                                   Total Ingest Pipeline failed |              |      0           |        |
|                                                 Min Throughput | query_string |    177.7         |  ops/s |
|                                                Mean Throughput | query_string |    213.13        |  ops/s |
|                                              Median Throughput | query_string |    215.15        |  ops/s |
|                                                 Max Throughput | query_string |    218.91        |  ops/s |
|                                        50th percentile latency | query_string |  83098.6         |     ms |
|                                        90th percentile latency | query_string | 153420           |     ms |
|                                        99th percentile latency | query_string | 169491           |     ms |
|                                      99.9th percentile latency | query_string | 171335           |     ms |
|                                     99.99th percentile latency | query_string | 171642           |     ms |
|                                       100th percentile latency | query_string | 171732           |     ms |
|                                   50th percentile service time | query_string |     20.4667      |     ms |
|                                   90th percentile service time | query_string |     31.5086      |     ms |
|                                   99th percentile service time | query_string |     49.6949      |     ms |
|                                 99.9th percentile service time | query_string |    150.478       |     ms |
|                                99.99th percentile service time | query_string |    179.699       |     ms |
|                                  100th percentile service time | query_string |    664.647       |     ms |
|                                                     error rate | query_string |      0           |      % |
  1. track.json

{
    "version": 2,
    "description": "์ปค์Šคํ…€ ํŠธ๋ž™ ์˜ˆ์‹œ: search with test param ",
    "indices": [
      {
        "name": "text-index"
      }
    ],
    "schedule":  [
        {
          "operation" : {
          "name": "query_string",
          "operation-type": "search",
          "param-source": "my-custom-term-param-source"
        },
         
          "clients": 20,
          "target-throughput" : {{ t_qps|default(1000) }},
          "time-period": {{ t_period|default(60) }}
      }
    ]
}
  1. track.py

import random
import csv

class TermParamSource:
    def __init__(self, track, params, **kwargs):
        # choose a suitable index: if there is only one defined for this track
        # choose that one, but let the user always override index and type.
        if len(track.indices) == 1:
            default_index = track.indices[0].name
            if len(track.indices[0].types) == 1:
                default_type = track.indices[0].types[0].name
            else:
                default_type = None
        else:
            default_index = "text-index"
            default_type ="_doc"

        # we can eagerly resolve these parameters already in the constructor...
        self._index_name = params.get("index", default_index)
        self._type_name = params.get("type", default_type)
        self._cache = params.get("cache", False)
        # ... but we need to resolve "profession" lazily on each invocation later
        self._params = params
        # Determines whether this parameter source will be "exhausted" at some point or
        # Rally can draw values infinitely from it.
        self.infinite = True
        self.words = self.init_word()
        
    def init_word(self):
        wordList = []
        f = open('/Users/gimdohyeon/rallyCustomTrack/custom-track-search/paramWord.csv', 'r', encoding='utf-8')
        rdr = csv.reader(f)
        reader = csv.reader(f,
                            delimiter = ",", quotechar = '"',
                            quoting = csv.QUOTE_ALL)
        
        for words in reader:
            for word in words:
                wordList.append(word)      
        f.close() 
        
        return wordList
    

    def partition(self, partition_index, total_partitions):
        return self
    

    def buildParam(self, length):
        param = ""
        for _ in range(0,length):
            param += (random.choice(self.words)  + " ")
        return param

    def params(self):
        # you must provide all parameters that the runner expects
        return {
        "body": {
            "query": {
                "query_string": {
                    "query": "%s" % self.buildParam(4)
                }
            }
        },
        "index": "text-index",
        "type": "_doc",
        "cache": False
    }
   
     
def register(registry):
    registry.register_param_source("my-custom-term-param-source", TermParamSource)

Hi @dan_kim,

here is my metric it's error rate is 0.01% and its service time looks reasonable to service, but latency is really bad

You have at least one request that errored out (i.e. non 2xx HTTP response from Elasticsearch). You should be able to find the errored status code (4xx or 5xx) in the logs in ~/.rally/logs/rally.log, though you may need to increase the elasticsearch logger to DEBUG in ~/.rally/logging.json to see the full error from Elasticsearch.

  1. if latency is lot more than service time, does it make sense that error rate goes higher?
  2. to make query body, my "track.py" read csv file which has 1000,000 word. is it can be a petential problem ?

Perhaps, but obviously it depends on what the actual error is. The fact that latency is > service time indicates that you likely have a bottleneck on the client side (rally).

latency == service_time + processing_time

I see you're using a Custom Parameter Source, which could be the issue here, we even warn that your param source is on what we call the "critical" code path, so param source code itself could be the source of the increased latency vs service time, or it could be other things like CPU throttling/network queuing, etc.

Your parameter source is on a performance-critical code-path. Double-check with Rallyโ€™s profiling support that you did not introduce any bottlenecks.

1 Like

Thanks for your answer, i followed your instruction, and if you help me a little bit more it would be really grateful

  1. checking error
    i turn my rally logging.json setting level INFO to DEBUG.
    there is some "-not-actor" log with response 400 , could you let me know what this it mean?
2023-03-09 03:58:41,346 -not-actor-/PID:8545 elasticsearch WARNING GET http://192.168.35.177:9200/text-index/_doc/_search?request_cache=false [status:400 request:0.010s]
  1. not use custom param and run
    i just run without param-source. and it still send around 200 req/s
    does it mean that my laptop(which runs rally) is bad ? or network ?
  1. checking error
    i turn my rally logging.json setting level INFO to DEBUG.
    there is some "-not-actor" log with response 400 , could you let me know what this it mean?
2023-03-09 03:58:41,346 -not-actor-/PID:8545 elasticsearch WARNING GET http://192.

A 400 status from Elasticsearch means it's a client error. If you've set the elasticsearch logger to DEBUG it should give you the full error message, but usually a 4xx means the query was malformed.

  1. not use custom param and run
    i just run without param-source. and it still send around 200 req/s
    does it mean that my laptop(which runs rally) is bad ? or network ?

I'd strongly encourage you not to run benchmarks from your laptop if you want to see consistent and accurate results. We have a great talk about the many different pitfalls you may run into with benchmarking called 'Seven Deadly Sins of Benchmarking' that I'd suggest you watch.

In addition to the great insights from @Bradley_Deam I wanted to highlight this part of the Rally docs: Frequently Asked Questions (FAQ) - Rally 2.10.0 documentation

where you will find a good explanation of latency and service_time.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.