Hello,
When pushing our data to Elastic (we made sequential batches of 100), it takes a lot of time even for a small amount of data (24h for 1.5Million documents).
Is it normal ?
What do you recommend to speed the indexing ?
Thanks
Hello,
When pushing our data to Elastic (we made sequential batches of 100), it takes a lot of time even for a small amount of data (24h for 1.5Million documents).
Is it normal ?
What do you recommend to speed the indexing ?
Thanks
Welcome!
Just to make sure:
client = AppSearch(self.host, http_auth=(self.user, self.password), request_timeout=6000, max_retries=10, retry_on_timeout=True)
for upsert_place in range(0, len(docs), 100):
responses = client.index_documents(
engine_name=engine,
documents=docs[upsert_place : upsert_place + self.upsert_size],
)
docs
is already a list of JSON, we take all this data from Snowflake where we have 1 table per engine
How long does it take to "just" read the content from Snowflake?
between 5 to 10 seconds I would say
For the whole dataset you mean?
Could you measure that?
For instance, when using my PC, it takes 5 secs - 10 secs to retrieve 2Million rows as a list of JSON from Snowflake (docs that correspond to 1 engine) -> and to push them, I split them into batches of 100 and it takes 2 or 5 seconds per batch.
So for 2Million rows, it takes days!
Just out of curiosity, could you send the data to Elasticsearch instead of AppSearch with batches of 10000 for example and tell how fast it is?
What does a typical document look like?
From what I understood from your code snippet and since you are indexing using batches, you don't need to loop over the list docs
and index each element. Based on this example, the parameter documents
accepts a list of json object, which is your case for docs
, so for me, I'd try this instead:
responses = client.index_documents(
engine_name=engine,
documents=docs,
)
I didn't use AppSearch before and don't know why the max is 100 docs / batch, why not more? But as @dadoonet pointed you can try sending data to elasticsearch and use the bulk function.
res = es.bulk(docs, index="my-index")
Then test for many batch size: 100, 1000, 10000, etc. and see how your cluster will perform while indexing!
Hope it helps
We can't index more than 100 documents at a single time using app search, that's why I'm looping on docs to take docs 100 per 100 - take a subset of the list then
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.