Search time consistency

Hello,

I have a simple question.

I have an elasticsearch index containing 1600000 relatively large
documents, and i need to scan the index to synchronize it with a classic
sql database.

My documents include the sql ID and timestamp.

Then to synchronize the sql db and the elastic index, i simply read rows
and documents sequentially, both sorted by id, and comparing the ids i can
determine if i need to delete the document (comparison is negative), add a
new document with the sql row (comparison is positive), and if comparison
is 0 i compare the timestamps to know if i need to update the document.

It works but i observe that reading the documents gets a lot slower as i
advance reading.

I retrieve my documents in chunks by repeating searches on the index,
shifting the "from" field of the request each time, something like this :

{
"from" : 0, "size" : 10000,
"fields" : ["idannonce","ts"],
"sort" : ["idannonce"],
"query" : "match_all" {}
}

This simple query is a lot slower when "from" is 1000000 than when it is 0.

Is this normal behaviour ? I thought that it should take aproximately the
same time as the "idannonce" field should be indexed, no ?

Any thought ? Is there a way to write the same query so that it runs in a
constant time ?

Thanks

PS: I was hoping to use the Scroll API but sadly it doesn't support sorting
si i'm using the Search API to sort my documents by idannonce.

Hiya

It works but i observe that reading the documents gets a lot slower as
i advance reading.

I retrieve my documents in chunks by repeating searches on the index,
shifting the "from" field of the request each time, something like
this :

{
"from" : 0, "size" : 10000,
"fields" : ["idannonce","ts"],
"sort" : ["idannonce"],
"query" : "match_all" {}
}

This simple query is a lot slower when "from" is 1000000 than when it
is 0.

Yes. This is a problem with sorting in a distributed environment. I
presume you have 5 primary shards. When you ask for docs 1,000,000 to
1,009,999 Elasticsearch has to retrieved the first 1,010,000 docs from
EACH shard, sort them, then return the correct 10,000 docs, discarding
5,040,000 of them...

You can understand why it gets slower :slight_smile:

The preferred way to pull lots of docs from ES is to use
search_type=scan, but that can't be combined with sorting.

One alternative is to break your queries into chunks with a range query,
eg all docs created in Jan 2010, then Feb 2010 etc

clint

It makes sense,
I thought elastic could optimise the query and directly identify the docs
it needed (it's a basic match_all query without any search criteria), it
could simply return the 10000 docs with idannonce above 1000000.
In my case i have a single shard, but elastic doesn't have to take this
into account.

It looks like i was wrong, i'm going to try with a range query.
Thanks

but ibecause it's a simple match_all, there no real search criteria in it.
I had the feeling that it could be possible to take the 10000 docs after
the

2012/6/22 Clinton Gormley clint@traveljury.com

Hiya

It works but i observe that reading the documents gets a lot slower as
i advance reading.

I retrieve my documents in chunks by repeating searches on the index,
shifting the "from" field of the request each time, something like
this :

{
"from" : 0, "size" : 10000,
"fields" : ["idannonce","ts"],
"sort" : ["idannonce"],
"query" : "match_all" {}
}

This simple query is a lot slower when "from" is 1000000 than when it
is 0.

Yes. This is a problem with sorting in a distributed environment. I
presume you have 5 primary shards. When you ask for docs 1,000,000 to
1,009,999 Elasticsearch has to retrieved the first 1,010,000 docs from
EACH shard, sort them, then return the correct 10,000 docs, discarding
5,040,000 of them...

You can understand why it gets slower :slight_smile:

The preferred way to pull lots of docs from ES is to use
search_type=scan, but that can't be combined with sorting.

One alternative is to break your queries into chunks with a range query,
eg all docs created in Jan 2010, then Feb 2010 etc

clint