Accelerate search scroll?

Hello, I want to export elasticsearch index to HDFS, via let's say PIG.

It's working fine, but it's slow, index with 3 shards, cluster with 3 nodes BUT a single client node (access point).

Before, I was using a small home made Java program that creates X workers (multi-thread) for each shard that search-scroll in same time, took 10 minutes to load the entire index.

Whereas with hadoop, I can see there is a single search-scroll process, and it took 30 minutes :confused:

Is it possible to "force" the number of workers, even with a single client node?

@ebuildy do you see the same sort of bottlenecking when using a different framework other than Pig? I suspect that this has to do with Pig's input split combining. By default, Pig tries to combine multiple input splits into one map task. When ES-Hadoop creates the input splits, there's no way for us to accurately estimate the size of the data within a shard, so the input splits we return are of size 1. This can cause Pig to think that many of the splits can be easily read by a single map task. I would try running the job again but setting the pig.noSplitCombination property to false.

You are right, I see some improvements with

SET pig.noSplitCombination TRUE;
SET default_parallel 5;

And I can see 1 open_contexts by shard! Thanks you. I am trying with 3 shards to see if there are 3 open_contexts.

Now I am trying with Hive, and see a single search_contexts :confused: by default

Here

http://b3.ms/XmWn0WY2AYOk

On left, Pig => 100Mbps
On right, Hive => 30 Mbps

Let's tune it!

No chance to make it working better :confused:

I see there is some changes on v5.0.0 that implement parrarel multi - shards right?

For now it's really to slow to work with :confused: