Speed indexing

Julie_Montels · October 24, 2022, 7:25am

Hello,

When pushing our data to Elastic (we made sequential batches of 100), it takes a lot of time even for a small amount of data (24h for 1.5Million documents).
Is it normal ?

What do you recommend to speed the indexing ?

Thanks

dadoonet · October 24, 2022, 10:56am

Welcome!

Just to make sure:

Are you sending the data to AppSearch or Elasticsearch?
How are you sending the data? Which API are you calling?
What is the source of the data? How are you fetching the data from the source?
Where are you running Elastic? On Elastic cloud? On a local machine?
Which hardware do you have?

Julie_Montels · October 24, 2022, 12:41pm

App search (so batch of 100 max)
I'm using Astronomer (scheduler) that creates a pod that is launching a python script. The client is elastic-enterprise-search (8.4)
The script is the following :

client = AppSearch(self.host, http_auth=(self.user, self.password), request_timeout=6000, max_retries=10, retry_on_timeout=True)
for upsert_place in range(0, len(docs), 100):
            responses = client.index_documents(
                engine_name=engine,
                documents=docs[upsert_place : upsert_place + self.upsert_size],
            )

so docs is already a list of JSON, we take all this data from Snowflake where we have 1 table per engine
Indexing using Astronomer takes at least 2/3 secs per batch of 100, from my PC it's around 5 secs per batch

dadoonet · October 24, 2022, 12:57pm

How long does it take to "just" read the content from Snowflake?

Julie_Montels · October 24, 2022, 12:59pm

between 5 to 10 seconds I would say

dadoonet · October 24, 2022, 1:11pm

For the whole dataset you mean?

Could you measure that?

Julie_Montels · October 24, 2022, 1:19pm

For instance, when using my PC, it takes 5 secs - 10 secs to retrieve 2Million rows as a list of JSON from Snowflake (docs that correspond to 1 engine) -> and to push them, I split them into batches of 100 and it takes 2 or 5 seconds per batch.
So for 2Million rows, it takes days!

dadoonet · October 25, 2022, 8:59am

Just out of curiosity, could you send the data to Elasticsearch instead of AppSearch with batches of 10000 for example and tell how fast it is?

What does a typical document look like?

marone · October 25, 2022, 10:15am

From what I understood from your code snippet and since you are indexing using batches, you don't need to loop over the list docs and index each element. Based on this example, the parameter documents accepts a list of json object, which is your case for docs, so for me, I'd try this instead:

responses = client.index_documents(
                engine_name=engine,
                documents=docs,
            )

I didn't use AppSearch before and don't know why the max is 100 docs / batch, why not more? But as @dadoonet pointed you can try sending data to elasticsearch and use the bulk function.

res = es.bulk(docs, index="my-index")

Then test for many batch size: 100, 1000, 10000, etc. and see how your cluster will perform while indexing!

Hope it helps

Julie_Montels · October 25, 2022, 2:03pm

We can't index more than 100 documents at a single time using app search, that's why I'm looping on docs to take docs 100 per 100 - take a subset of the list then

system · November 22, 2022, 2:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AppSearch Peformance - Out of the Box - SLOW Elastic Search elastic-app-search	4	536	March 29, 2021
App Search on Elastic Cloud (out of the box) seems slow: 800ms ~ 1200ms Elastic Search elastic-app-search	5	826	November 3, 2021
Indexation of many documents in Appsearch Elastic Search elastic-app-search	1	443	October 16, 2019
In the tradition of unscientific benchmarks :) Elasticsearch	1	288	July 6, 2017
Self Hosted , app search is very slow Elastic Search elastic-app-search	1	482	July 23, 2020

Speed indexing

Related topics