I am trying to execute a full extract of the Elasticsearch data on intervals or in a monthly based (depending on what creates less load), push it to a file and then load it to another system (Hive) for analytics. I am currently trying to build a solution for the extraction step.
The index size is relatively large(400+ million of records) and most solutions involving the REST API would lead to millions of queries daily
Given that information, what are the recommended bulk extraction mechanisms that you suggest?
Tentatively, I looked at a couple solutions but I did not get to any conclusions:
A custom solution to query the index on intervals, add an indicator (like extracted = true) and repeat indefinitely until there are no more results in the index. We would leverage the REST API using PIT or similar pagination mechanisms - Paginate search results | Elasticsearch Guide [8.12] | Elastic
What are you using to send data to your cluster? Depending on the sender you might have an option to set up both an Elasticsearch output and a file output, and then instead of a monthly/periodic pull from Elasticsearch you'll already have your data in a separate file. That wouldn't allow you to collect anything currently in the cluster, but "going forward" it might be an option.
I don't think the output of a snapshot is going to be usable by anything except Elasticsearch.
Sadly , that is not an option because it does not serve the goal we are trying to accomplish
Ultimately, the goal is to ensure that records are inserted to Elasticsearch and that no error/unexpected ingestion problem happens. As far as that goes, it seems that going directly to Elasticsearch to read the data is the only safe mechanism to do so
Aaaah, in that case I think you need to see if there's any possibility of Elasticsearch or other logs recording the errors about which you're concerned, and thereby maybe avoid the need to export everything for verification, or do your bulk extraction either in small bites or at a low-utilization time for your cluster (assuming you have a regularly occurring low-utilization time - I know some use cases don't).
450 million docs a month works out to around 10400 per minute, so your first solution would, I think, stand a chance of being a low-intrusiveness solution, certainly relative to a "all of the docs at once". I think you're right to be thinking of adding a "has been read" indicator. The Elasticsearch-Dump looks interesting. Depending on how your indices are set up - a few really large ones vs a few dozen medium size vs hundreds of small ones - it might be worth giving Elasticsearch-Dump a try on a mid-size index, see how long it takes, see the impact on overall cluster performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.