I am using a python script to pull the data from elasticsearch and process it. It has around 5000k records , but it's more then 24 hrs and it's still running. What i can do to make it fast.
Since you're just trying to pull out data with scrolls, I would move those clauses into the filter portion of the boolean query. This will remove the scoring aspect and should help speed it up. Right now the query is scoring all the boolean components which is not necessary because you don't actually care about the score.
I would also check and make sure there aren't any exceptions in the server log or client log I've definitely run into slow batches before and realized afterwards that I had an error in my code and it was spewing exceptions for hours, not making any progress. This can happen if you're not using the scroll ID correctly for example
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.