I want export my documents using Scroll API or search after request but I can't decide which one should I use. First of all, I have 2 different index.
First one is constant, I mean index document size is constant. I have 2 million of documents and it does not change (update or delete). It is fixed.
My other document also has millions of records, it depends actually. 4-5-6 millions and it can grow up. Documents are updated continuously.
My question is this. I want to export my documents but I want to do it in some time interval. Let's say. 20k documents per minute. I can do it with scroll api it is okay. However if my system is down I should not lose my scrollId so I can continue the export operation where I stopped. Or user want to export 100 documents for every minute and if I have 10 millions of records it takes long to export all my documents. Is that a bad operation and load for elasticsearch ? Because it takes snapshot of my index and my documents are updated continiously, that means the old documents are still kept for a long time. (Ofcourse I should close the scrollId after my job is finished.)
How long should I keeping search context alive ? Can I open scroll=6hours for let's say or is it bad practice to keep connection for 6hours ? The default time is 24h I think, if the default is 24 it should not be a problem ?
Or should I use search after and export my documents into batches and continue if my system is down where it stopped because I can know where it stopped from sort value or tie_breaker_id.
This sounds like an ideal task for a message queue, which Elasticsearch is not. If you need consumers to access all documents being indexed, insert them to a message queue as well and read from there. If you need to export based on query, write to a message queue and let the client pull from it at any desired pace.
I am only able to use ElasticSearch for now so I should find a solution because of limited time . My problem is consume those messages per a minute, so I should export x records in 1 minute. X can depend and of course has a upper limit. Scroll API and search after is ok for me but my concern is different. Does it create a load when I keep scroll alive for a long time? What happens if I use search after ? Thanks !
Thank you for your idea. I understand that using message queue would better solution but I can't do it right now. Obviously it takes cost and opening connection is not a best practice for 2-3 millions of records, so I would prefer Search After API. Is there any consideration, idea or problem come into your mind for Search After API because I am able to use that option. (Ofcourse my sort logic would be a unique like createTime for documents etc.). Much thanks !
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.