How does Azure OpenAI Inference with Semantic Text Field Handle Rate Limiting and Chunking for High Token Volume?

I’m using Azure OpenAI's embedding model in an inference pipeline, where documents are indexed with the semantic_text field in an Elastic Cloud setup.

I would like to understand how rate limiting is handled in such scenarios:

  1. Rate Limiting Handling:
    If my inference setup sends more than 1 million tokens per minute or exceeds 1440 requests per minute, how can I ensure that the Azure OpenAI endpoint doesn’t start throttling or dropping requests? Is there a built-in retry mechanism or should I handle this externally?
  2. Semantic Text Field and Chunking:
    When using the semantic_text field in an Elasticsearch pipeline, and a document exceeds the allowed token limits per request, will the semantic_text field's built-in chunking logic respect the rate limits of the Azure OpenAI model? Or do I need to manage chunking and pacing at the client or middleware level?

Any guidance or real-world implementation experience around setting up resilient inference pipelines with rate limits and large document sizes is appreciated.

Hey @Ateet_Agarwal, welcome to the community!

The only automatic retries are those that already exist in the _bulk API. Our recommendation is to ensure that you are sending traffic that is compatible with your token and traffic limits.

For chunking, you can handle your specific chunking settings (in Serverless - coming to a stack release soon). More information can be found in this blog post. If you want to customize them on the client side, you would use the none chunking strategy to send in your own chunks.

Hope that helps!

Hi @Kathleen_DeRusso - Thanks for your prompt respose. I am trying to leverage the benefits of semantic_text fields with its GA in 8.18.
So, does it mean - If somehow chunking fails or rate limit occur because of the lengthy document then will it consider document as status=failed?

Correct, you should receive an appropriate error response and it will be reported by the bulk API call.

1 Like

@Kathleen_DeRusso
Today I tried to verify _bulk update api with two cases:

  1. Prepared a body request with more than 50 Mb in size and try to ingest it on an index having semantic_text field. I got below error:
"error": {
    "type": "inference_exception",    // Error when running machine learning inference
    "reason": "Exception when running inference id [embedding-openai-model] on field [body_embedding]",
    "caused_by": {
      "type": "status_exception",
      "reason": "Failed to send Azure OpenAI embeddings request. Cause: Maximum limit of [52428800] bytes reached",
      "caused_by": {
        "type": "input_stream_too_large_exception",
        "reason": "Maximum limit of [52428800] bytes reached"
      }
    }
  }
  1. Reduced the Emebedding model token limit to 1K token from 1M token then get this error:
"error": {
                    "type": "inference_exception",
                    "reason": "inference_exception: Exception when running inference id [embedding-openai-model] on field [body_embedding]",
                    "caused_by": {
                        "type": "status_exception",
                        "reason": "Received a rate limit status code. Remaining tokens [unknown]. Remaining requests [unknown]. for request from inference entity id [embedding-openai-model] status [429]. Error message: [Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-02-01 have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 60 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. For Free Account customers, upgrade to Pay as you Go here: https://aka.ms/429TrialUpgrade.]"
                    }
                }
            }

I am able to understand rate limit issue 429 as mentioned in 2nd point.

But for error 500 I need some more insight on it.

  • Is this issue from Elastic end? I mean is it unable to send the request to model through inference after a certain limit of body content?
  • I am able to successfully exeucte the first request on text field but when it come to semantic_text field then it throw 500 error as stated. Would like to know more details about it?
  • How it can be fixed?

Hi @Ateet_Agarwal can we please use Discuss and not Slack to continue this conversation as it is happening in both places? Discuss is not transient so it's probably preferable.

It looks like this was a transient issue per our discussion but I encourage you to add a bug report issue if you see this again, ideally with a stack trace through appending ?error_trace=true to the request. Thank you!

@Kathleen_DeRusso
Sure. I wanted to let you know that this s replicable. When tried with ?error_trace=true then below is the error.

"status": 500,
"error": {
  "type": "inference_exception",
  "reason": "inference_exception: Exception when running inference id [embedding-openai-model] on field [body_embedding]",
  "caused_by": {
      "type": "status_exception",
      "reason": "Failed to send Azure OpenAI embeddings request. Cause: Maximum limit of [52428800] bytes reached",
      "caused_by": {
          "type": "i_o_exception",
          "reason": "Maximum limit of [52428800] bytes reached",
          "suppressed": [
              {
                  "type": "retry_exception",
                  "reason": "retry_exception: org.elasticsearch.ElasticsearchStatusException: Received a server busy error status code for request from inference entity id [embedding-openai-model] status [503]. Error message: [The service is temporarily unable to process your request. Please try again later.]",
                  "caused_by": {
                      "type": "status_exception",
                      "reason": "Received a server busy error status code for request from inference entity id [embedding-openai-model] status [503]. Error message: [The service is temporarily unable to process your request. Please try again later.]"
                  }
              }
          ]
      }
  }
}

I just need to know on what basis should I create the _bulk request body.

  • Should it be on the basis of number of words?
  • Should it be on the basis of request body size? In my case as of now, it is below 1Mb

Thanks for duplicating.

This looks like a passthrough error from OpenAI. That also makes sense as it is transient. If that's the case, smaller chunks may help, but it's hard to tell, especially without a stack trace returned (I don't see that in the response you posted).

I think at this point, this is definitely a bug especially as it is returning a 500. I would encourage you to either (or both!) open a bug report in the Elasticsearch Github repo and contact your support representative. The other thing that we may be able to do with a support ticket is to get a request/response so engineering can try to duplicate this issue.

Thank you!