[hadoop] control of inputsplit number

Damien_Feugas · March 5, 2014, 12:49pm

Hello there.

I'm working with the Hadoop-ElasticSearch jar to distribute an algorithm
over a sharded index.
The processing is fairly simple : performing a given task on each document
inside the index.

As the algorithm only works on one document, it can be distributed and
parallelised.
Therefore I put it in a MapReduce job, within the mapper. No reducer is
needed.

Currently, the index has 6 shard, and consequently, my algorithm is run in
parallel in 6 different mappers.
But the index contains about 3 millions documents, and each map process
(sequentially) about 500 000 documents, which takes 4 hour in average.

I'd like to have more mappers executed in parallel.

I read on the documentation that associating an InputSplit per Shard is
done in purpose. I can change the number of shard, but I wonder if there
was an alternative.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c862573f-57e1-4bb5-8d84-494a5de163df%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.