Timeout Bulk indexing with python client for even low number of documents

I have platinum ELK on Azure with Kibana, I have setup 2 ML inference pipelines with ELSER for 2 different indices.

Then I have started indexing for both of the indices with ml inference pipeline. But indexing is slower and some times even raise timeout even for 10 docs.

I am just indexing text_fields and keyword. For one index it has 14 fields and for second one only 4 fileds.

I have to index more than 100k docs, but only 20k docs are indexed 2 whole days. Is there any best practices I should follow for faster bulk indexing.

Here is python snippet for reference,

for res in results:
        splitted_facts_list.append({"index": {"_index": SPLITTED_FACTS}})
        splitted_facts_list.append(
            {
                    "field1": res[0],
                    "field2": res[1],
                    "field3": res[1],
                    "field4": res[2],
                    "field5": res[3],
                    "_extract_binary_content": True,
                    "_reduce_whitespace": True,
                    "_run_ml_inference": True,
               }
         )
es_client.bulk(operations=splitted_facts_list, pipeline=SPLITTED_FACTS_PIPELINE)

Hi

Inference is slower than pure indexing but 20K docs in 2 days sounds slow.

When you deploy the ELSER model set the number of allocations equal to the number of CPU cores you have on your ml node. This will maximise through put for your hardware. If you have maxed out the number of allocations that will fit on your ml node add another ml node or scale the node up. You can always scale down when the indexing is complete.

If the timeout is occurring in the Python client waiting for the bulk operation to complete you have a couple of options:

Try sending 50 or 100 items per bulk, I prefer to use a small bulk size as you can monitor the progress and get updates more frequently if watching the indexed document count go up. A bulk containing thousands of documents may take minutes to process whereas 50 documents might take seconds or 10s of seconds.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.