Is there a way to speed up the average inference time?

Is there a way to speed up the average inference time in ML, as it is currently slow? I’m currently running version 8.13.2. The ML node currently has 32 vCPUs, and I've allocated 32 threads to the model. My goal is to achieve an average response time of around 300ms for the text_similarity infer API when it's not in a cached state. Any advice you could give would be greatly appreciated.

Hi @shwanlee

Thanks for using the infer API for text_similarity, which model are you using for the task?

When you deploy the model there are 2 settings: the number of allocations and threads per allocation. To reduce response time use 1 allocation and give as many threads to the threads per allocation as you can.

Please can you share the output of the model stats API

GET _ml/trained_models/<model_id_or_deployment_id>/_stats

Hi @dkyle
The number of allocations and threads per allocation settings have been configured. The ML Node server is an AWS EC2 instance of type m7i.8xlarge.

Are you measuring response time? It is the time the taken to perform the _ml/trained_model/<model_id>/_infer request?

How many documents are you sending in the request, could you share a sample request please

We send 10 documents per request, with an average response time of around 1 second, and sometimes exceeding 2 seconds. To operate our service, we need an average response time of less than 300 milliseconds. Is this technically feasible? I’ll create and send a sample request for you to review.

If you have 10 documents per request it might be faster to use more allocations as the documents can be processed in parallel by each allocation. Try 10 allocations with 1 or 2 threads each

I'll give it a try! Here is the sample request.

POST _ml/trained_models/dongjin-kr__ko-reranker/_infer
{
  "docs": [
    {
      "text_field": "Semantic search is a search method that helps you find data based on the intent and contextual meaning of a search query, instead of a match on query terms (lexical search). Elasticsearch provides semantic search capabilities using natural language processing (NLP) and vector search. Deploying an NLP model to Elasticsearch enables it to extract text embeddings out of text. Embeddings are vectors that provide a numeric representation of a text. Pieces of content with similar meaning have similar representations."
    },
    {
      "text_field": "Elastic Learned Sparse EncodeR - or ELSER - is an NLP model trained by Elastic that enables you to perform semantic search by using sparse vector representation. Instead of literal matching on search terms, semantic search retrieves results based on the intent and the contextual meaning of a search query. The instructions in this tutorial shows you how to use ELSER to perform semantic search on your data."
    },
    {
      "text_field": "[Limitation] The following limitations and known problems apply to the 8.13.4 release of the Elastic natural language processing trained models feature. ELSER semantic search is limited to 512 tokens per field that inference is applied to. When you use ELSER for semantic search, only the first 512 extracted tokens from each field of the ingested documents that ELSER is applied to are taken into account for the search process. If your data set contains long documents, divide them into smaller segments before ingestion if you need the full text to be searchable. Only the first 512 extracted tokens per field are considered during semantic search with ELSER. Refer to this page for more information. The minimum dedicated ML node size for deploying and using the ELSER model is 4 GB in Elasticsearch Service if deployment autoscaling is turned off. Turning on autoscaling is recommended because it allows your deployment to dynamically adjust resources based on demand. Better performance can be achieved by using more allocations or more threads per allocation, which requires bigger ML nodes. Autoscaling provides bigger nodes when required. If autoscaling is turned off, you must provide suitably sized nodes yourself. ELSER output must be ingested into a field with the sparse_vector or rank_features field type. Otherwise, Elasticsearch interprets the token-weight pairs as a massive amount of fields in a document. If you get an error similar to this Limit of total fields [1000] has been exceeded while adding new fields then the ELSER output field is not mapped properly and it has a field type different than sparse_vector or rank_features. In this step, you load the data that you later use in the inference ingest pipeline to extract tokens from it. Use the msmarco-passagetest2019-top1000 data set, which is a subset of the MS MARCO Passage Ranking data set. It consists of 200 queries, each accompanied by a list of relevant text passages. All unique passages, along with their IDs, have been extracted from that data set and compiled into a tsv file. IMOPRTANT: The msmarco-passagetest2019-top1000 dataset was not utilized to train the model. It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes. You can use a different data set to test the workflow and become familiar with it. Download the file and upload it to your cluster using the Data Visualizer in the Machine Learning UI. Assign the name id to the first column and content to the second column. The index name is test-data. Once the upload is complete, you can see an index named test-data with 182469 documents."
    },
    {
      "text_field": "The instructions in this tutorial shows you how to use the inference API with various services to perform semantic search on your data. The following examples use Cohere’s embed-english-v3.0 model, the all-mpnet-base-v2 model from HuggingFace, and OpenAI’s text-embedding-ada-002 second generation embedding model. You can use any Cohere and OpenAI models, they are all supported by the inference API. For a list of supported models available on HuggingFace, refer to the supported model list. Click the name of the service you want to use on any of the widgets below to review the corresponding instructions. The mapping of the destination index - the index that contains the embeddings that the model will create based on your input text - must be created. The destination index must have a field with the dense_vector field type to index the output of the used model."
    },
    {
      "text_field": "A search query, or query, is a request for information about data in Elasticsearch data streams or indices. You can think of a query as a question, written in a way Elasticsearch understands. Depending on your data, you can use a query to get answers to questions like: What processes on my server take longer than 500 milliseconds to respond What users on my network ran regsvr32.exe within the last week What pages on my website contain a specific word or phrase Elasticsearch supports several search methods: Search for exact values Search for exact values or ranges of numbers, dates, IPs, or strings. Full-text search Use full text queries to query unstructured textual data and find documents that best match query terms. Vector search Store vectors in Elasticsearch and use approximate nearest neighbor (ANN) or k-nearest neighbor (kNN) search to find vectors that are similar, supporting use cases like semantic search. Run a searchedit To run a search request, you can use the search API or Search Applications. Search API The search API enables you to search and aggregate data stored in Elasticsearch using a query language called the Query DSL. Search Applications Search Applications enable you to leverage the full power of Elasticsearch and its Query DSL, with a simplified user experience. Create search applications based on your Elasticsearch indices, build queries using search templates, and easily preview your results directly in the Kibana Search UI."
    },
    {
      "text_field": "App Search Enterprise Search guides Enterprise Search App Search Workplace Search Programming language clients Node.js client PHP client Python client Ruby client › Guides « Sanitization Guide Synonyms Guide » Search guide edit Searching is how you read data and generate results from the documents stored within your Engines. The search endpoint will be invoked each time a search is performed by a user. Unlike the other endpoints, which customize Engines, view analytics, tune relevance, or index documents, search is for, well... Searching ! You can use your Public Search Key or your Private API Key to query the search endpoint. The Public Search Key is for performing search using client side JavaScript, within mobile application development, or any other context where you are at risk of exposing your API Key. The Public Search Key key begins with: search- . The responses generated by the search endpoint should contain data from your engines that you want your users to see. That is why it has its own special, public key. For more API authentication options, refer to the Authentication guide. What do I search? How do I search? What about result meta data? Why search? Where to Next? Search tasks Or, instead read the search API reference to get coding. What do I search? edit Search is all about finding documents . A document is an individual JSON object. When you add data into your Engines, you are taking database or backend API objects and indexing them. But what does it mean, to index?"
    },
    {
      "text_field": "Enterprise Search Enterprise Search guides Enterprise Search App Search Workplace Search Programming language clients Node.js client PHP client Python client Ruby client Elastic connectors » What is Elastic Enterprise Search? edit Enterprise Search is an additional Elastic service that adds APIs and UIs to those already provided by Elasticsearch and Kibana. Enterprise Search enables native connectors for Elastic Cloud deployments, the Elastic web crawler, and two standalone products: App Search and Workplace Search. The Enterprise Search documentation covers the features provided by the server, compatible libraries, and the operation of the server. Features and libraries edit Enterprise Search server enables several Elastic features. Compatible clients and libraries are available for working with these features. These documentation sections are most relevant to developers : Connectors Connectors sync data from various databases and content sources to Elasticsearch. In Elastic Cloud, the Enterprise Search service provides native connectors that require no additional infrastructure. Also documented here are connector clients , which enable you to customize connectors and run them within your own infrastructure. Web crawler Enterprise Search enables the Elastic web crawler, which syncs data from web pages to Elasticsearch. App Search and Workplace Search Enterprise Search is required for users of App Search and Workplace Search, two standalone products that use additional APIs, UIs, and abstractions for application search and workplace search use cases. Programming language clients Enterprise Search has its own client libraries that provide Enterprise Search APIs in various programming languages. Search UI Search UI is a high level client for building user interfaces in React and other supported JavaScript frameworks. Server edit The following documentation sections are most relevant to service operators. Enterprise Search server Supporting documentation edit Prerequisites Known issues Troubleshooting Help, support, and feedback Release notes Elastic connectors"
    },
    {
      "text_field": "Machine Learning in the Elastic Stack [8.12] › Natural language processing › Overview « Classify text Deploy trained models » Search and compare text edit The Elastic Stack machine learning features can generate embeddings, which you can use to search in unstructured text or compare different pieces of text. Text embedding Text similarity Text embedding edit Text embedding is a task which produces a mathematical representation of text called an embedding. The machine learning model turns the text into an array of numerical values (also known as a vector ). Pieces of content with similar meaning have similar representations. This means it is possible to determine whether different pieces of text are either semantically similar, different, or even opposite by using a mathematical similarity function. This task is responsible for producing only the embedding. When the embedding is created, it can be stored in a dense_vector field and used at search time. For example, you can use these vectors in a k-nearest neighbor (kNN) search to achieve semantic search capabilities."
    },
    {
      "text_field": "Elasticsearch Guide [8.12] › Search your data « Search templates Search Applications search API and templates » Elastic Search Applications edit Search Applications enable users to build search-powered applications that leverage the full power of Elasticsearch and its Query DSL, with a simplified user experience. Create search applications based on your Elasticsearch indices, build queries using search templates, and easily preview your results directly in the Kibana Search UI. You can also interact with your search applications using the Search Application APIs . Search Applications are designed to simplify building unified search experiences across a range of enterprise search use cases, using the Elastic platform. Search Applications documentation Documentation for the Search Applications feature lives in two places: The documentation in this section covers the basics of Search Applications, information about working with Search Applications in the Kibana UI, and use case examples. The Elasticsearch API documentation contains the API references for working with Search Applications programmatically. Jump there if you’re only interested in the APIs. Availability and prerequisites edit The Search Applications feature was introduced in Elastic version 8.8.0 . Search Applications is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features."
    },
    {
      "text_field": "Your search applications use search templates to perform searches. Templates help reduce complexity by exposing only template parameters, while using the full power of Elasticsearch’s query DSL to formulate queries. Templates may be set when creating or updating a search application, and can be customized. This template can be edited or updated at any time using the Put Search Application API API call. In a nutshell, you create search templates with parameters instead of specific hardcoded search values. At search time, you pass in the actual values for these parameters, enabling customized searches without rewriting the entire query structure. Search Application templates: Simplify query requests Reduce request size Ensure security and performance, as the query is predefined and can’t be changed arbitrarily This document provides some sample templates to get you started using search applications for additional use cases. These templates are designed to be easily modified to meet your needs. Once you’ve created a search application with a template, you can search your search application using this template. "
    }
  ],
  "inference_config": {
    "text_similarity": {
      "text": "what is semantic search?",
      "tokenization": {
        "xlm_roberta": {
          "truncate": "second"
        }
      }
    }
  }
}

@dkyle The average response time hasn't dropped below 1 second and remains between 1 and 2 seconds, which seems similar to before.

I experimented with various values for threads_per_allocation and number_of_allocations and found 16 allocations with 2 threads each gave the fastest response but I could not get the response time down to 300ms

The ko-reranker is a large model a smaller model will be faster, BAAI/bge-reranker-base might be a good option for you.

Another option is to use the Cohere rerank model which is available through the Elasticsearch Inference API. Here's a blog describing how to get started with Cohere Rerank Elasticsearch open Inference API adds support for Cohere’s Rerank 3 model — Elastic Search Labs

Thank you very much. As you suggested, I increased the number_of_allocations and decreased the threads_per_allocation. By sending only the title and excluding the context to reduce the amount, the average response time has significantly decreased.

Lastly, could you explain the criteria for model allocation? In my opinion, at least one should be allocated to the second node, but in reality, nothing is assigned to the second node and memory exceeds on other nodes.

The sizes of the models allocated to each cluster vary significantly. I would appreciate it if someone could provide an answer.

[ threads_per_allocation 4 / number_of_allocations 1 ]

[ threads_per_allocation 2 / number_of_allocations 7 ]