Scrolling performance


(Josh Holtzman) #1

I'm trying to implement an index exporter by doing a "match_all" query
and scrolling through the entire index, 100 or 1000 documents at a
time. I'm seeing a significant slowdown in scrolling over time. The
first scroll via the rest api returns in < 50ms, but once I've
scrolled through 1.5 million of the 2 million total docs, the time to
execute is > 1 second. I've set the scroll timeout to 10 seconds,
which performs better than 10 minutes, but I can't decrease the
timeout much more without risking timing out between calls.

I'm wondering if a) this dramatic slowdown is expected, and b) if
there's a better way to scroll through all documents quickly.

Thanks,
Josh


(Josh Holtzman) #2

Sorry, I missed this thread, which covers the same topic:
http://groups.google.com/group/elasticsearch/browse_thread/thread/667811af4a1f5fae

On Nov 23, 1:47 pm, jholtzman jmholtz...@gmail.com wrote:

I'm trying to implement an index exporter by doing a "match_all" query
and scrolling through the entire index, 100 or 1000 documents at a
time. I'm seeing a significant slowdown in scrolling over time. The
first scroll via the rest api returns in < 50ms, but once I've
scrolled through 1.5 million of the 2 million total docs, the time to
execute is > 1 second. I've set the scroll timeout to 10 seconds,
which performs better than 10 minutes, but I can't decrease the
timeout much more without risking timing out between calls.

I'm wondering if a) this dramatic slowdown is expected, and b) if
there's a better way to scroll through all documents quickly.

Thanks,
Josh


(Shay Banon) #3
  • Setting the a lower scroll timeout will not affect the performance of
    scrolling.
  • Strange that it ends up being 1 second for the 1.5M scroll on match_all,
    how many shards do you have in the index? How many nodes in the cluster?

On Wed, Nov 23, 2011 at 11:49 PM, jholtzman jmholtzman@gmail.com wrote:

Sorry, I missed this thread, which covers the same topic:

http://groups.google.com/group/elasticsearch/browse_thread/thread/667811af4a1f5fae

On Nov 23, 1:47 pm, jholtzman jmholtz...@gmail.com wrote:

I'm trying to implement an index exporter by doing a "match_all" query
and scrolling through the entire index, 100 or 1000 documents at a
time. I'm seeing a significant slowdown in scrolling over time. The
first scroll via the rest api returns in < 50ms, but once I've
scrolled through 1.5 million of the 2 million total docs, the time to
execute is > 1 second. I've set the scroll timeout to 10 seconds,
which performs better than 10 minutes, but I can't decrease the
timeout much more without risking timing out between calls.

I'm wondering if a) this dramatic slowdown is expected, and b) if
there's a better way to scroll through all documents quickly.

Thanks,
Josh


(Josh Holtzman) #4

4 shards, all on a single node.

Thanks,
Josh

On Nov 24, 6:13 am, Shay Banon kim...@gmail.com wrote:

  • Setting the a lower scroll timeout will not affect the performance of
    scrolling.
  • Strange that it ends up being 1 second for the 1.5M scroll on match_all,
    how many shards do you have in the index? How many nodes in the cluster?

On Wed, Nov 23, 2011 at 11:49 PM, jholtzman jmholtz...@gmail.com wrote:

Sorry, I missed this thread, which covers the same topic:

http://groups.google.com/group/elasticsearch/browse_thread/thread/667...

On Nov 23, 1:47 pm, jholtzman jmholtz...@gmail.com wrote:

I'm trying to implement an index exporter by doing a "match_all" query
and scrolling through the entire index, 100 or 1000 documents at a
time. I'm seeing a significant slowdown in scrolling over time. The
first scroll via the rest api returns in < 50ms, but once I've
scrolled through 1.5 million of the 2 million total docs, the time to
execute is > 1 second. I've set the scroll timeout to 10 seconds,
which performs better than 10 minutes, but I can't decrease the
timeout much more without risking timing out between calls.

I'm wondering if a) this dramatic slowdown is expected, and b) if
there's a better way to scroll through all documents quickly.

Thanks,
Josh


(Josh Holtzman) #5

The documentation at http://www.elasticsearch.org/guide/reference/api/search/scroll.html
describes scrolling through a large set of data using this URL as an
example: http://localhost:9200/twitter/tweet/_search?scroll=5m

This unfortunately excludes the key parameter: search_type=scan.
Following the documentation at http://www.elasticsearch.org/guide/reference/api/search/search-type.html
did the trick, and now the scrolling performance is once again
constant across requests.

Thanks,
Josh

On Nov 30, 2:50 pm, jholtzman jmholtz...@gmail.com wrote:

4 shards, all on a single node.

Thanks,
Josh

On Nov 24, 6:13 am, Shay Banon kim...@gmail.com wrote:

  • Setting the a lower scroll timeout will not affect the performance of
    scrolling.
  • Strange that it ends up being 1 second for the 1.5M scroll on match_all,
    how many shards do you have in the index? How many nodes in the cluster?

On Wed, Nov 23, 2011 at 11:49 PM,jholtzmanjmholtz...@gmail.com wrote:

Sorry, I missed this thread, which covers the same topic:

http://groups.google.com/group/elasticsearch/browse_thread/thread/667...

On Nov 23, 1:47 pm,jholtzmanjmholtz...@gmail.com wrote:

I'm trying to implement an index exporter by doing a "match_all" query
and scrolling through the entire index, 100 or 1000 documents at a
time. I'm seeing a significant slowdown in scrolling over time. The
first scroll via the rest api returns in < 50ms, but once I've
scrolled through 1.5 million of the 2 million total docs, the time to
execute is > 1 second. I've set the scroll timeout to 10 seconds,
which performs better than 10 minutes, but I can't decrease the
timeout much more without risking timing out between calls.

I'm wondering if a) this dramatic slowdown is expected, and b) if
there's a better way to scroll through all documents quickly.

Thanks,
Josh


(system) #6