When using Elasticsearch's Inference API and Inference Processor to vectorize and write text (using Alibaba's text embedding model, PR:([Inference API] Add Alibaba Cloud AI Search Model support to Inference API #111181), we set the bulk batch size to 100. However, this immediately triggered a rate limit error from the alibaba text_embedding interface. We found that the rate limit for this interface is 50 QPS, and the interface supports passing multiple documents in an array (with a maximum batch size of 32). This means the processing capacity of the interface is 50 * 32 = 1600 documents per second.
However, because we are using the Inference Processor for writing, Elasticsearch creates a separate request for each document, which limits the write rate to 50 documents per second.
I want to know if there are any good methods to address this issue. For example, enabling the Inference Processor to support batch processing of data instead of making individual inference API calls for each document.