Scan/Scroll performance degrading logarithmically

jordansparked · May 4, 2016, 10:27pm

Hi, I'm using ES 1.7.3 and am attempting to re-index using scan/scroll. I'm noticing that scan is very fast at the beginning, but performance slowly degrades the further into the results I get. For example, the first batch of 100k docs takes 7s to query/iterate over, but by the 15 millionth doc, it's taking 10 minutes to query/iterate over 100k docs. Is this expected? From everything I've read, using scan should solve this issue, but it doesn't appear to be having any affect.

I am using elasticsearch-py's reindex() helper, so I initially filed a bug there, but I'm posting here because it's looking more and more like a core ES issue with scan rather than something related to the Python client. I have many more details (graphs, benchmarks, hot_threads) posted in that bug: https://github.com/elastic/elasticsearch-py/issues/397

Getting over this hurdle is essential for our upgrade to ES2.3. We're basically blocked from reindexing a relatively modest-sized index due to this, so any help would be appreciated. Thx!

jordansparked · May 9, 2016, 6:15pm

Bumping this... we're currently blocked from upgrading to 2.X. We're stuck on 1.7 until we can update our mappings and reindex. Basic question: is scan/scroll expected to slow down the deeper into the results you get?

jprante · May 9, 2016, 9:20pm

No.

Without being able to look into the source code, I can't comment.

jprante · May 9, 2016, 9:28pm

Sorry, I just saw the es-py issue, it boils down to the internals of the reindex function. This is hard to answer for me. It seems, you have to implement your own scan/scroll calls to get finer control about query/filter and ensure there is no scoring or sorting in the way - I'm not too familiar with python and I don't know the intrinsics of the python client.

If you like comparisons, you could try to run knapsack plugin export on your data. It uses Java client methods: https://github.com/jprante/elasticsearch-knapsack

Topic		Replies	Views
Query result differ / Scan and scroll result in very low performance using Python API Elasticsearch	2	2294	September 15, 2017
Scan/Scroll performance and cache Elasticsearch	11	3481	July 5, 2017
Reindex API performance Elasticsearch	3	4494	July 5, 2017
Elastic Search performance degraded after veriosn update from ES 1.4.2 to ES 7.9 Elasticsearch	1	342	November 5, 2020
Issues with scan and scroll as well as count API Elasticsearch	5	1878	July 5, 2017

Scan/Scroll performance degrading logarithmically

Related topics