How to speed up Language Identification Inference

Hi, I were trying the language identification feature (Language identification | Machine Learning in the Elastic Stack [7.12] | Elastic), and noticed that the model inference was very slow.

Originally, my elasticsearch 7.12.1 cluster could index at least 3000 documents every second.

But after applying the pipeline, it dropped to index 100 documents per second.

Does anyone know how to speed it up?
I have tried using more nodes and using more CPUs, but they seemed to be no help.

Hey @rueian , there will always be some performance impact when adding an inference processor to your pipeline.

But, we do our best to make it fast. The model should be completely contained in cache on the ingest node.

  • What is the JVM allocated to the ingest nodes in your cluster?
  • How many inference processors are located in your pipeline?
  • Is the pipeline doing any else more complicated?

You may want to look at GET _node/stats to see the ingest throughput stats to make sure of the bottle neck as well.

Also, take a look at your settings here: Machine learning settings in Elasticsearch | Elasticsearch Guide [7.12] | Elastic

If there is a large gap in the times that batches of documents arrive, you may want to increase xpack.ml.inference_model.time_to_live to match that so that the model is never evicted from cache.

Hi @BenTrent,

I allocated 2GB for JVM Heap on every node.
I were also not sure whether 2GB was enough or not for the lang_ident_model_1 model. I didn't find any recommendation about it on the document.

The only inference processor was the lang_ident_model_1 processor.
And from the time_in_millis metric of GET _nodes/stats/ingest, the processor was indeed the bottleneck. It took 95% of time of the pipeline.

@rueian , I am wondering if the inference time is including the time it took to load the model initially and put it into cache.

Once you have one pipeline running and have sent some docs through, I wonder if you could open another pipeline (this one will use the model that is in cache from the previous pipeline) and whats the throughput there?

Again, there will always be some slow down, just wanting to make sure how much there is :slight_smile:

Hi @BenTrent ,

I just tested with a bigger cluster. Each node had 8 core and 14GB JVM Heap and they finally achieved indexing 3000 documents per second.

I also noticed that the JVM Heap usage keep changing dramatically, however the only job the cluster did was indexing those documents with the pipeline. Was the usage expected? or it was the signal that the model being keep moved in and out the cache?

@rueian it could be dropping in and out of cache, but it should only do that if its TTL was reached or it was kicked out of cache by another model (since this is your only model, I don't think the latter is occurring).

The JVM heap utilization doesn't seem to have a cadence of 5m. It very well could be the inference action itself, encoding the text then pushing it through the model and then decoding. That creates some short lived double[] values that could increase JVM, and then young GC cleans them out quickly without blocking.

The way to verify is to look at the logs of the cluster and see if there are logs indicating as much for ModelLoadingService (FQDN org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService)

To see which docs are hitting cache or not, you could turn on TRACE for that particular class, but that will write multiple log messages PER document, so it would not only create a ton of log messages, but could impact performance

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.