Wait-until-merges-finish-after-index operations

Hello Rally Gurus,

I am testing my ES cluster using ESRally with openai_vector track and I have a question about the "wait-until-merges-finish-after-index" operation. In the track README file, I see that there is this track parameter "parallel_indexing_time_period (default: 1800)". Is this parameter the "wait-until-merges-finish-after-index" operation? If it is, is it wise to lower the wait time since that's 30 minutes of inactivity. Am I way off base? Please explain.

Best,

Tom

Hi Tom,

The default (and only) track challenge in OpenAI vector performs parallel index and search operations as its primary set of tasks. Before these parallel operations begin, there is a standalone initial indexing operation to pre-load the index, followed by a refresh (to commit any remaining indexing ops disk), then followed by the wait-until-merges-finish-after-index operation you mentioned. wait-until-merges-finish-after-index polls the cluster looking for any remaining in-flight segment merges that could pollute the benchmark and does not use the parallel_indexing_time_period track parameter. This task should not take long since it is only waiting for segment merges to finish.

parallel_indexing_time_period controls how long the indexing portion should run in the parallel search & indexing task execution.

Thank you,
Jason

Thank you, Jason, for a quick reply and explanation. Much appreciate it!

Best,

Tom

Hi Jason,

Could I ask another question? How come I don't see either cohere_vector or openai_vector benchmarks in this link? Is it because the dataset is too big to run nightly tests?

Best,

Tom

Hi Tom,

There is no particular reason other than the developers of the cohere_vector and openai_vector tracks chose not to have them included in the nightly regression test benchmarks.

Thanks,
Jason

Thank you, sir!

Tom

Hi Jason,

May I ask another question? I am not exactly sure what this operation does (standalone-search-knn-100-1000-multiple-clients) in the cohere_vector track? I checked the default.json under the operations directory but still not clear what it does. Instead of guessing, could you help explain its operation?

Thanks,

Tom

Hi Tom,

I will break it down from the top:

  1. The standalone-search-knn-100-1000-multiple-clients task performs the knn-search-100-1000 operation with some number of search clients (default 8, configured with track parameter standalone_search_clients) and number of iterations (default 10000, configured with track parameter standalone_search_iterations). Each client executes the same number of iterations.

  2. In the knn-search-100-1000 operation:
    a. It is a search operation.
    b. The parameter source is knn-param-source
    c. Parameter k is set to 100.
    d. Parameter num-candidates is 1000.

What does it mean?

knn-param-source is registered in the track's track.py file from class KnnParamSource. Without getting too much into the specifics of KnnParamSource, parameters k and num-candidates are used to build a search request body in the form of a Knn query similar to those found at Knn query | Elasticsearch Guide [8.13] | Elastic. Each execution of the search (for the configured number of iterations) uses a query vector from the queries.json, also found in the track.

The Knn query docs referenced above better describe Knn queries, composition, and functionality.

Thank you,
Jason

Thank you, Jason, for the detailed explanation! Very much appreciate it.

Best,

Tom