Accelerate search scroll?

ebuildy · October 14, 2016, 2:39pm

Hello, I want to export elasticsearch index to HDFS, via let's say PIG.

It's working fine, but it's slow, index with 3 shards, cluster with 3 nodes BUT a single client node (access point).

Before, I was using a small home made Java program that creates X workers (multi-thread) for each shard that search-scroll in same time, took 10 minutes to load the entire index.

Whereas with hadoop, I can see there is a single search-scroll process, and it took 30 minutes

Is it possible to "force" the number of workers, even with a single client node?

james.baiera · October 14, 2016, 6:21pm

@ebuildy do you see the same sort of bottlenecking when using a different framework other than Pig? I suspect that this has to do with Pig's input split combining. By default, Pig tries to combine multiple input splits into one map task. When ES-Hadoop creates the input splits, there's no way for us to accurately estimate the size of the data within a shard, so the input splits we return are of size 1. This can cause Pig to think that many of the splits can be easily read by a single map task. I would try running the job again but setting the pig.noSplitCombination property to false.

ebuildy · October 15, 2016, 10:28am

You are right, I see some improvements with

SET pig.noSplitCombination TRUE;
SET default_parallel 5;

And I can see 1 open_contexts by shard! Thanks you. I am trying with 3 shards to see if there are 3 open_contexts.

ebuildy · October 15, 2016, 12:20pm

Now I am trying with Hive, and see a single search_contexts by default

Here

http://b3.ms/XmWn0WY2AYOk

On left, Pig => 100Mbps
On right, Hive => 30 Mbps

Let's tune it!

ebuildy · October 17, 2016, 10:11am

No chance to make it working better

I see there is some changes on v5.0.0 that implement parrarel multi - shards right?

For now it's really to slow to work with

Topic		Replies	Views
ES Indexing from Hadoop Issues Elasticsearch	5	815	July 6, 2017
Reading from Elasticsearch index using spark ( es-hadoop ) connectors Elasticsearch es-hadoop	2	1734	March 22, 2022
[Hadoop][Pig] Loadbalancing over multiple servers Elasticsearch	7	385	July 6, 2017
Performance problems scroll search Elasticsearch	2	772	July 5, 2017
Hive read es data slow Elasticsearch es-hadoop	5	1168	December 20, 2019

Accelerate search scroll?

Related topics