Saving Results from Millions of Documents of varying sizes with Python Elasticsearch Client

hacker_21 · March 29, 2017, 7:20am

Hi everyone,

I'm trying to get feedback on the best approach for querying my elasticsearch cluster, I have:

6 billion documents in one index
documents of varying sizes, of the format:
{
"domain" : "website1.com",
"links" : ["link1.com", "link2.com", "link3.com"]
}
each document can have as many items as possible in the "links" list ... my estimate is anywhere from 1 to 10's of thousands items
a small search result for a query might return 3 documents, where as a large one might return 40 million documents

Ideally, I want my script to be able to export millions of document results from elasticsearch and save them as one or multiple plain text, CSV, or JSON files. What would be a fast and ideally cost-effective way of querying my database and saving the results? Are there any best practices?

Here's what I've come up with so far:

One machine with lots of memory that uses the scan helper function going a maximum of 10-20k documents size per request?
Some kind of distributed approach which queries a document range amongst 7 different machines. So, it would calculate the total # of document results and then divide querying them 1,000 at a time in some kind of cluster that saves the results.
??? another option which I'm missing

Any help would be greatly appreciated, thank you in advance!

system · April 26, 2017, 7:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Searching through every document in ES? Elasticsearch	8	254	November 5, 2021
ElasticSearch Performance Elasticsearch	4	348	October 12, 2020
Need advise on how to building a 4 billions documents index with Elasticsearch Elasticsearch	3	509	March 11, 2019
Python3 Elastic Query returns large amount of results.. am i doing this right? Elasticsearch	1	344	December 29, 2020
Retrieving over a million records in Elasticsearch Elasticsearch	10	28102	July 5, 2017

Saving Results from Millions of Documents of varying sizes with Python Elasticsearch Client

Related topics