Hi everyone,
I'm trying to get feedback on the best approach for querying my elasticsearch cluster, I have:
- 6 billion documents in one index
- documents of varying sizes, of the format:
{
"domain" : "website1.com",
"links" : ["link1.com", "link2.com", "link3.com"]
} - each document can have as many items as possible in the "links" list ... my estimate is anywhere from 1 to 10's of thousands items
- a small search result for a query might return 3 documents, where as a large one might return 40 million documents
Ideally, I want my script to be able to export millions of document results from elasticsearch and save them as one or multiple plain text, CSV, or JSON files. What would be a fast and ideally cost-effective way of querying my database and saving the results? Are there any best practices?
Here's what I've come up with so far:
- One machine with lots of memory that uses the scan helper function going a maximum of 10-20k documents size per request?
- Some kind of distributed approach which queries a document range amongst 7 different machines. So, it would calculate the total # of document results and then divide querying them 1,000 at a time in some kind of cluster that saves the results.
- ??? another option which I'm missing
Any help would be greatly appreciated, thank you in advance!