Suggested method to extract data in bulk

yeikel · January 30, 2024, 2:44pm

Hi all,

I am trying to execute a full extract of the Elasticsearch data on intervals or in a monthly based (depending on what creates less load), push it to a file and then load it to another system (Hive) for analytics. I am currently trying to build a solution for the extraction step.

The index size is relatively large(400+ million of records) and most solutions involving the REST API would lead to millions of queries daily

Given that information, what are the recommended bulk extraction mechanisms that you suggest?

Tentatively, I looked at a couple solutions but I did not get to any conclusions:

A custom solution to query the index on intervals, add an indicator (like extracted = true) and repeat indefinitely until there are no more results in the index. We would leverage the REST API using PIT or similar pagination mechanisms - Paginate search results | Elasticsearch Guide [8.12] | Elastic
Elasticsearch Dump - GitHub - elasticsearch-dump/elasticsearch-dump: Import and export tools for elasticsearch & opensearch. Trigger extractions on intervals using cron or similar
Snapshot-Restore - Snapshot and restore | Elasticsearch Guide [8.12] | Elastic

Can the data of Snapshot-Restore be consumed by third party tools?

Kafka connect plugin - GitHub - confluentinc/kafka-connect-elasticsearch: Kafka Connect Elasticsearch connector. Like Elasticsearch Dump but using Kafka instead.

I am looking for the least intrusive method that will cause the least load to the cluster. It should in general not impact the search performance.

RLPowellJr · January 30, 2024, 8:11pm

What are you using to send data to your cluster? Depending on the sender you might have an option to set up both an Elasticsearch output and a file output, and then instead of a monthly/periodic pull from Elasticsearch you'll already have your data in a separate file. That wouldn't allow you to collect anything currently in the cluster, but "going forward" it might be an option.

I don't think the output of a snapshot is going to be usable by anything except Elasticsearch.

yeikel · January 31, 2024, 12:32am

Sadly , that is not an option because it does not serve the goal we are trying to accomplish

Ultimately, the goal is to ensure that records are inserted to Elasticsearch and that no error/unexpected ingestion problem happens. As far as that goes, it seems that going directly to Elasticsearch to read the data is the only safe mechanism to do so

RLPowellJr · January 31, 2024, 2:23pm

Aaaah, in that case I think you need to see if there's any possibility of Elasticsearch or other logs recording the errors about which you're concerned, and thereby maybe avoid the need to export everything for verification, or do your bulk extraction either in small bites or at a low-utilization time for your cluster (assuming you have a regularly occurring low-utilization time - I know some use cases don't).

450 million docs a month works out to around 10400 per minute, so your first solution would, I think, stand a chance of being a low-intrusiveness solution, certainly relative to a "all of the docs at once". I think you're right to be thinking of adding a "has been read" indicator. The Elasticsearch-Dump looks interesting. Depending on how your indices are set up - a few really large ones vs a few dozen medium size vs hundreds of small ones - it might be worth giving Elasticsearch-Dump a try on a mid-size index, see how long it takes, see the impact on overall cluster performance.

system · February 28, 2024, 2:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Faster to index raw data or load a static data dump Elasticsearch	9	1138	October 4, 2020
How to deal with building huge bulk load indices fast without impacting prod queries or paying a fortune to over-provision the cluster Elasticsearch	10	3580	July 5, 2017
Streaming custom data source to Elastic Elasticsearch	3	484	November 5, 2019
Retaining Historical Data Outside of ElasticStack Logstash	10	620	January 1, 2019
Monthly backup and restore along with snapshot Elasticsearch	3	1765	December 6, 2019

Suggested method to extract data in bulk

Related topics