Scan/Scroll performance degrading logarithmically

Hi, I'm using ES 1.7.3 and am attempting to re-index using scan/scroll. I'm noticing that scan is very fast at the beginning, but performance slowly degrades the further into the results I get. For example, the first batch of 100k docs takes 7s to query/iterate over, but by the 15 millionth doc, it's taking 10 minutes to query/iterate over 100k docs. Is this expected? From everything I've read, using scan should solve this issue, but it doesn't appear to be having any affect.

I am using elasticsearch-py's reindex() helper, so I initially filed a bug there, but I'm posting here because it's looking more and more like a core ES issue with scan rather than something related to the Python client. I have many more details (graphs, benchmarks, hot_threads) posted in that bug: https://github.com/elastic/elasticsearch-py/issues/397

Getting over this hurdle is essential for our upgrade to ES2.3. We're basically blocked from reindexing a relatively modest-sized index due to this, so any help would be appreciated. Thx!

Bumping this... we're currently blocked from upgrading to 2.X. We're stuck on 1.7 until we can update our mappings and reindex. Basic question: is scan/scroll expected to slow down the deeper into the results you get?

No.

Without being able to look into the source code, I can't comment.

Sorry, I just saw the es-py issue, it boils down to the internals of the reindex function. This is hard to answer for me. It seems, you have to implement your own scan/scroll calls to get finer control about query/filter and ensure there is no scoring or sorting in the way - I'm not too familiar with python and I don't know the intrinsics of the python client.

If you like comparisons, you could try to run knapsack plugin export on your data. It uses Java client methods: https://github.com/jprante/elasticsearch-knapsack