Optimised Keep Alive Time for Scroll API

Hello,

I want export my documents using Scroll API or search after request but I can't decide which one should I use. First of all, I have 2 different index.

First one is constant, I mean index document size is constant. I have 2 million of documents and it does not change (update or delete). It is fixed.

My other document also has millions of records, it depends actually. 4-5-6 millions and it can grow up. Documents are updated continuously.

My question is this. I want to export my documents but I want to do it in some time interval. Let's say. 20k documents per minute. I can do it with scroll api it is okay. However if my system is down I should not lose my scrollId so I can continue the export operation where I stopped. Or user want to export 100 documents for every minute and if I have 10 millions of records it takes long to export all my documents. Is that a bad operation and load for elasticsearch ? Because it takes snapshot of my index and my documents are updated continiously, that means the old documents are still kept for a long time. (Ofcourse I should close the scrollId after my job is finished.)

How long should I keeping search context alive ? Can I open scroll=6hours for let's say or is it bad practice to keep connection for 6hours ? The default time is 24h I think, if the default is 24 it should not be a problem ?

Or should I use search after and export my documents into batches and continue if my system is down where it stopped because I can know where it stopped from sort value or tie_breaker_id.

Any idea would help me. Thanks !

This sounds like an ideal task for a message queue, which Elasticsearch is not. If you need consumers to access all documents being indexed, insert them to a message queue as well and read from there. If you need to export based on query, write to a message queue and let the client pull from it at any desired pace.

I am only able to use ElasticSearch for now so I should find a solution because of limited time :frowning: . My problem is consume those messages per a minute, so I should export x records in 1 minute. X can depend and of course has a upper limit. Scroll API and search after is ok for me but my concern is different. Does it create a load when I keep scroll alive for a long time? What happens if I use search after ? Thanks !

Setting up a message queue would take a lot less time than trying to reliably do this via Elasticsearch. I also suspect it would be a better solution.

Keeping scrolls open requires the segments the scroll is based on to be left in place, which has an impact on merging, indexing and disk space usage.

Thank you for your idea. I understand that using message queue would better solution but I can't do it right now. Obviously it takes cost and opening connection is not a best practice for 2-3 millions of records, so I would prefer Search After API. Is there any consideration, idea or problem come into your mind for Search After API because I am able to use that option. (Ofcourse my sort logic would be a unique like createTime for documents etc.). Much thanks !