I have an elasticsearch index containing 1600000 relatively large
documents, and i need to scan the index to synchronize it with a classic
sql database.
My documents include the sql ID and timestamp.
Then to synchronize the sql db and the elastic index, i simply read rows
and documents sequentially, both sorted by id, and comparing the ids i can
determine if i need to delete the document (comparison is negative), add a
new document with the sql row (comparison is positive), and if comparison
is 0 i compare the timestamps to know if i need to update the document.
It works but i observe that reading the documents gets a lot slower as i
advance reading.
I retrieve my documents in chunks by repeating searches on the index,
shifting the "from" field of the request each time, something like this :
This simple query is a lot slower when "from" is 1000000 than when it
is 0.
Yes. This is a problem with sorting in a distributed environment. I
presume you have 5 primary shards. When you ask for docs 1,000,000 to
1,009,999 Elasticsearch has to retrieved the first 1,010,000 docs from
EACH shard, sort them, then return the correct 10,000 docs, discarding
5,040,000 of them...
You can understand why it gets slower
The preferred way to pull lots of docs from ES is to use
search_type=scan, but that can't be combined with sorting.
One alternative is to break your queries into chunks with a range query,
eg all docs created in Jan 2010, then Feb 2010 etc
It makes sense,
I thought elastic could optimise the query and directly identify the docs
it needed (it's a basic match_all query without any search criteria), it
could simply return the 10000 docs with idannonce above 1000000.
In my case i have a single shard, but elastic doesn't have to take this
into account.
It looks like i was wrong, i'm going to try with a range query.
Thanks
but ibecause it's a simple match_all, there no real search criteria in it.
I had the feeling that it could be possible to take the 10000 docs after
the
This simple query is a lot slower when "from" is 1000000 than when it
is 0.
Yes. This is a problem with sorting in a distributed environment. I
presume you have 5 primary shards. When you ask for docs 1,000,000 to
1,009,999 Elasticsearch has to retrieved the first 1,010,000 docs from
EACH shard, sort them, then return the correct 10,000 docs, discarding
5,040,000 of them...
You can understand why it gets slower
The preferred way to pull lots of docs from ES is to use
search_type=scan, but that can't be combined with sorting.
One alternative is to break your queries into chunks with a range query,
eg all docs created in Jan 2010, then Feb 2010 etc
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.