Saving Results from Millions of Documents of varying sizes with Python Elasticsearch Client

Hi everyone,

I'm trying to get feedback on the best approach for querying my elasticsearch cluster, I have:

  • 6 billion documents in one index
  • documents of varying sizes, of the format:
    {
    "domain" : "website1.com",
    "links" : ["link1.com", "link2.com", "link3.com"]
    }
  • each document can have as many items as possible in the "links" list ... my estimate is anywhere from 1 to 10's of thousands items
  • a small search result for a query might return 3 documents, where as a large one might return 40 million documents

Ideally, I want my script to be able to export millions of document results from elasticsearch and save them as one or multiple plain text, CSV, or JSON files. What would be a fast and ideally cost-effective way of querying my database and saving the results? Are there any best practices?

Here's what I've come up with so far:

  1. One machine with lots of memory that uses the scan helper function going a maximum of 10-20k documents size per request?
  2. Some kind of distributed approach which queries a document range amongst 7 different machines. So, it would calculate the total # of document results and then divide querying them 1,000 at a time in some kind of cluster that saves the results.
  3. ??? another option which I'm missing

Any help would be greatly appreciated, thank you in advance!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.