Random scan results?


(Josh Harrison) #1

I need to be able to pull 100s of thousands to millions of random documents
from my indexes. Normally, to pull data this large I'd do a scan query, but
they don't support sorting, so the suggestions I've seen online for
randomizing your results don't work (such as those discussed here:
http://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch).
Is there a way to introduce randomness into a basic scan query?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

Hi Josh,

In order to run efficiently, scan queries read records sequentially on disk
and keep a cursor that is used to maintain state between successive pages.
It would not be possible to get records in a random order as it would not
be possible to read sequentially anymore.

On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison hijakk@gmail.com wrote:

I need to be able to pull 100s of thousands to millions of random
documents from my indexes. Normally, to pull data this large I'd do a scan
query, but they don't support sorting, so the suggestions I've seen online
for randomizing your results don't work (such as those discussed here:
http://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch
).
Is there a way to introduce randomness into a basic scan query?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7q5kL-9T3g15d-2fqRJ1X8B6i-dMh3CO%3D8rLYidov2Eg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Josh Harrison) #3

Darn ok. Thank you.
If I'm retrieving large numbers of random largish (twitter river records)
documents, is there a particular pattern I should use for searching? That
is, does it make sense to send 20 sequential queries with size 10,000 and
random sorting, or a single query with a size of 200,000? What about up
into the millions? Obviously we're risking duplication of results when
sending multiple smaller queries, but this is OK for our purposes, or can
be dealt with at another stage of the process outside ES.
Thanks,
Josh

On Wednesday, February 19, 2014 12:41:58 PM UTC-8, Adrien Grand wrote:

Hi Josh,

In order to run efficiently, scan queries read records sequentially on
disk and keep a cursor that is used to maintain state between successive
pages. It would not be possible to get records in a random order as it
would not be possible to read sequentially anymore.

On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison <hij...@gmail.com<javascript:>

wrote:

I need to be able to pull 100s of thousands to millions of random
documents from my indexes. Normally, to pull data this large I'd do a scan
query, but they don't support sorting, so the suggestions I've seen online
for randomizing your results don't work (such as those discussed here:
http://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch
).
Is there a way to introduce randomness into a basic scan query?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fabec423-97a6-4246-bf11-5d2899ca64b9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #4

The issue with this workload is that it is very random I/O-intensive so I'm
afraid it might behave badly when your index grows larger than the size of
your filesystem cache. (This issue is not specific to Elasticsearch, any
data store would suffer from this issue when trying to fetch large numbers
of random records.)

That said, if your index is small and/or have lots of RAM for your
filesystem cache, this might still work well enough.

Regarding your question about sizes, Elasticsearch roughly does 1 or 2
random seeks per search term (in the inverted index) and 1 per returned
document. Since your sizes are large, running 20 queries with size=10K or 1
with size=200K doesn't change much wrt disk seeks as they are dominated by
the seeks to return documents.

However, memory-wise, Elasticsearch is going to be much happier if you run
more search requests with smaller sizes, so I would recommend running 20
queries with a size of 10K (or maybe even 200 with size=1K).

On Wed, Feb 19, 2014 at 9:56 PM, Josh Harrison hijakk@gmail.com wrote:

Darn ok. Thank you.
If I'm retrieving large numbers of random largish (twitter river records)
documents, is there a particular pattern I should use for searching? That
is, does it make sense to send 20 sequential queries with size 10,000 and
random sorting, or a single query with a size of 200,000? What about up
into the millions? Obviously we're risking duplication of results when
sending multiple smaller queries, but this is OK for our purposes, or can
be dealt with at another stage of the process outside ES.
Thanks,
Josh

On Wednesday, February 19, 2014 12:41:58 PM UTC-8, Adrien Grand wrote:

Hi Josh,

In order to run efficiently, scan queries read records sequentially on
disk and keep a cursor that is used to maintain state between successive
pages. It would not be possible to get records in a random order as it
would not be possible to read sequentially anymore.

On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison hij...@gmail.com wrote:

I need to be able to pull 100s of thousands to millions of random
documents from my indexes. Normally, to pull data this large I'd do a scan
query, but they don't support sorting, so the suggestions I've seen online
for randomizing your results don't work (such as those discussed here:
http://stackoverflow.com/questions/9796470/random-
order-pagination-elasticsearch).
Is there a way to introduce randomness into a basic scan query?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fabec423-97a6-4246-bf11-5d2899ca64b9%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j46Q5SBxrdX-WYDirBDQcbifQ2WtH%2BfzFJy%2BGpFCWWUNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5