Whether I should use elasticsearch-spark-20_2.11-5.2.2.jar other than elasticsearch-hadoop-hive-.5.2.2.jar for loading hive table into Elasticsearch?

My hive has more than 30 nodes, and my table's space is almost 140GB, another, my elasticsearch cluster ( 3 data nodes with 8 cores/16G memory) is isolated from the hive. Now,
I want to load data from hive into es according Apache Hive integration.

The following is my hiveQL script:

add jar elasticsearch-hadoop-5.2.2.jar;
list jar;
drop table usercenter_dw.performance_hive2es;
CREATE EXTERNAL TABLE `usercenter_dw.performance_hive2es`(
`elastic_0` string COMMENT '', 
`elastic_1` int COMMENT '', 
......
`elastic_78` string COMMENT '', 
`elastic_79` string COMMENT '')
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties('es.resource' = 'hive2es/artists', 'es.nodes' = '172.21.1.31', 'es.index.auto.create' = 'true', 'es.mapping.id'='elastic_0', 'es.batch.size.entries'='0', 'es.batch.size.bytes' = '4mb');

set mapred.job.queue.name=eng;
set mapred.reduce.tasks=9;
insert overwrite table usercenter_dw.performance_hive2es select * from usercenter_dw.performance_es limit 100000;

When I run the script under beeline or hive prompt, I got the following message:

INFO  : The url to track the job: http://nn1.bitauto.dmp:8088/proxy/application_1489400630906_62460/
INFO  : Starting Job = job_1489400630906_62460, Tracking URL = http://nn1.bitauto.dmp:8088/proxy/application_1489400630906_62460/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/bin/hadoop job  -kill job_1489400630906_62460
INFO  : Hadoop job information for Stage-0: number of mappers: 631; number of reducers: 1
INFO  : 2017-04-07 09:57:13,237 Stage-0 map = 0%,  reduce = 0%
INFO  : 2017-04-07 09:57:43,177 Stage-0 map = 1%,  reduce = 0%, Cumulative CPU 1230.13 sec
INFO  : 2017-04-07 09:57:47,886 Stage-0 map = 2%,  reduce = 0%, Cumulative CPU 2170.94 sec

From the above message, I found number of mappers: 631; number of reducers: 1, and the data starts to be transformed into my elasticsearch cluster only if the running stage reaches map = 100%, reduce = 67% as shown below :

 INFO  : 2017-04-07 10:15:54,103 Stage-0 map = 100%,  reduce = 67%, Cumulative CPU 35014.57 sec
INFO  : 2017-04-07 10:16:54,503 Stage-0 map = 100%,  reduce = 67%, Cumulative CPU 35057.39 sec
INFO  : 2017-04-07 10:17:38,336 Stage-0 map = 100%,  reduce = 67%, Cumulative CPU 35089.77 sec
INFO  : 2017-04-07 10:18:38,842 Stage-0 map = 100%,  reduce = 67%, Cumulative CPU 35124.49 sec

I think only one reduce task leads to such long loading time (almost one hour) , and obviously my set mapred.reduce.tasks=9; does not become effective, and according to my experience, there is no other methods for adjusting the number of reduce tasks.
So I want to know whether I should choose elasticsearch-spark-20_2.11-5.2.2.jar to build spark application, where I can set the number of partitions for appropriate shards?

mapred.reduce.tasks is a depreciated property, have you tried mapreduce.job.reduces?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.