Speed indexing

Hello,

When pushing our data to Elastic (we made sequential batches of 100), it takes a lot of time even for a small amount of data (24h for 1.5Million documents).
Is it normal ?

What do you recommend to speed the indexing ?

Thanks

Welcome!

Just to make sure:

  • Are you sending the data to AppSearch or Elasticsearch?
  • How are you sending the data? Which API are you calling?
  • What is the source of the data? How are you fetching the data from the source?
  • Where are you running Elastic? On Elastic cloud? On a local machine?
  • Which hardware do you have?
  • App search (so batch of 100 max)
  • I'm using Astronomer (scheduler) that creates a pod that is launching a python script. The client is elastic-enterprise-search (8.4)
  • The script is the following :
client = AppSearch(self.host, http_auth=(self.user, self.password), request_timeout=6000, max_retries=10, retry_on_timeout=True)
for upsert_place in range(0, len(docs), 100):
            responses = client.index_documents(
                engine_name=engine,
                documents=docs[upsert_place : upsert_place + self.upsert_size],
            )
  • so docs is already a list of JSON, we take all this data from Snowflake where we have 1 table per engine
  • Indexing using Astronomer takes at least 2/3 secs per batch of 100, from my PC it's around 5 secs per batch

How long does it take to "just" read the content from Snowflake?

between 5 to 10 seconds I would say

For the whole dataset you mean?

Could you measure that?

For instance, when using my PC, it takes 5 secs - 10 secs to retrieve 2Million rows as a list of JSON from Snowflake (docs that correspond to 1 engine) -> and to push them, I split them into batches of 100 and it takes 2 or 5 seconds per batch.
So for 2Million rows, it takes days!

Just out of curiosity, could you send the data to Elasticsearch instead of AppSearch with batches of 10000 for example and tell how fast it is?

What does a typical document look like?

From what I understood from your code snippet and since you are indexing using batches, you don't need to loop over the list docs and index each element. Based on this example, the parameter documents accepts a list of json object, which is your case for docs, so for me, I'd try this instead:

responses = client.index_documents(
                engine_name=engine,
                documents=docs,
            )

I didn't use AppSearch before and don't know why the max is 100 docs / batch, why not more? But as @dadoonet pointed you can try sending data to elasticsearch and use the bulk function.

res = es.bulk(docs, index="my-index")

Then test for many batch size: 100, 1000, 10000, etc. and see how your cluster will perform while indexing!

Hope it helps :slight_smile:

1 Like

We can't index more than 100 documents at a single time using app search, that's why I'm looping on docs to take docs 100 per 100 - take a subset of the list then

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.