Hello, I want to export elasticsearch index to HDFS, via let's say PIG.
It's working fine, but it's slow, index with 3 shards, cluster with 3 nodes BUT a single client node (access point).
Before, I was using a small home made Java program that creates X workers (multi-thread) for each shard that search-scroll in same time, took 10 minutes to load the entire index.
Whereas with hadoop, I can see there is a single search-scroll process, and it took 30 minutes
Is it possible to "force" the number of workers, even with a single client node?
@ebuildy do you see the same sort of bottlenecking when using a different framework other than Pig? I suspect that this has to do with Pig's input split combining. By default, Pig tries to combine multiple input splits into one map task. When ES-Hadoop creates the input splits, there's no way for us to accurately estimate the size of the data within a shard, so the input splits we return are of size 1. This can cause Pig to think that many of the splits can be easily read by a single map task. I would try running the job again but setting the pig.noSplitCombination property to false.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.