What's the quickest way to extract a LARGE amount of records out of ES? Best practices for scroll API are welcome

For those of you that have had to extract all records to perform an operation on them, that's the best way to go about it? I'm looking into extracting records, doing an operation on them and putting the records back into ES. (already have the setup working using logstash and rabbitmq but it's quite slow).

I know that the scroll API is preferable (I was going to use it via Logstash). However, what's the best size you've have used?

Instead of querying for X thing, can I use the scroll API on all records of say shard 4? That way I can run multiple scroll APIs for different shards simultaneously.

I thought of querying by record to say something like "return all documents mathing A*" then have another process return everything that matches "B*" and so on but that would be too much I think.

Here are my specs:

My cluster has 20 data nodes(20 primary shards, 2 replicas), 1 client and 1 master. (I'm adding more masters soon).

Elasticsearch version: 2.3.3

Marvel version: 2.3.0

JVM version: 1.8.0_91

OS version: Ubuntu 14.04

I have mlockall set to true, ran "sudo swapoff -a" on the entire cluster. I'm allocating 16 gb of ram in each node (8gb for ES_HEAP_SIZE).

The most efficient way to do some operation across large numbers of documents is to adapt the operation into an Elasticsearch Aggregation. This is because they can use doc values, a columnar store of the data. They don't need to read the _source out of stored fields which is slow because it is compress in chunks with many documents being stored in the same chunk. If possible you should use an aggregation. If that isn't possible you might be able to implement an aggregation as a plugin. If you want performance when you do things across many hit aggregations are by far the best option.

If you can't use aggregations then your best bet is to use the scroll API. The scroll API will be too slow for you. You can make it faster by slicing it. Elasticsearch 5.0.0-alpha4 will have sliced scrolling which should be much faster. You can also do the prefix query style thing you are talking about. You could also do a script query across the _id that hashes and mods. That works fairly well. The query isn't efficient because it has to visit the documents but query time isn't that big a portion of scroll. Usually scroll is dominated by _source extraction. Which is why aggregations are so much faster - they don't need to perform _source extraction.

Depending on your situation, you might want to just declare two of your data nodes master eligible. Master nodes don't need to be big though (usually) so if you happen to be running in an environment that supports virtualization they are a great candidate for it.

1 Like