My hive has more than 30 nodes, and my table's space is almost 140GB, another, my elasticsearch cluster ( 3 data nodes with 8 cores/16G memory) is isolated from the hive. Now,
I want to load data from hive into es according Apache Hive integration.
The following is my hiveQL script:
add jar elasticsearch-hadoop-5.2.2.jar; list jar; drop table usercenter_dw.performance_hive2es; CREATE EXTERNAL TABLE `usercenter_dw.performance_hive2es`( `elastic_0` string COMMENT '', `elastic_1` int COMMENT '', ...... `elastic_78` string COMMENT '', `elastic_79` string COMMENT '') stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler' tblproperties('es.resource' = 'hive2es/artists', 'es.nodes' = '172.21.1.31', 'es.index.auto.create' = 'true', 'es.mapping.id'='elastic_0', 'es.batch.size.entries'='0', 'es.batch.size.bytes' = '4mb'); set mapred.job.queue.name=eng; set mapred.reduce.tasks=9; insert overwrite table usercenter_dw.performance_hive2es select * from usercenter_dw.performance_es limit 100000;
When I run the script under beeline or hive prompt, I got the following message:
INFO : The url to track the job: http://nn1.bitauto.dmp:8088/proxy/application_1489400630906_62460/ INFO : Starting Job = job_1489400630906_62460, Tracking URL = http://nn1.bitauto.dmp:8088/proxy/application_1489400630906_62460/ INFO : Kill Command = /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/bin/hadoop job -kill job_1489400630906_62460 INFO : Hadoop job information for Stage-0: number of mappers: 631; number of reducers: 1 INFO : 2017-04-07 09:57:13,237 Stage-0 map = 0%, reduce = 0% INFO : 2017-04-07 09:57:43,177 Stage-0 map = 1%, reduce = 0%, Cumulative CPU 1230.13 sec INFO : 2017-04-07 09:57:47,886 Stage-0 map = 2%, reduce = 0%, Cumulative CPU 2170.94 sec
From the above message, I found
number of mappers: 631; number of reducers: 1, and the data starts to be transformed into my
elasticsearch cluster only if the running stage reaches
map = 100%, reduce = 67% as shown below :
INFO : 2017-04-07 10:15:54,103 Stage-0 map = 100%, reduce = 67%, Cumulative CPU 35014.57 sec INFO : 2017-04-07 10:16:54,503 Stage-0 map = 100%, reduce = 67%, Cumulative CPU 35057.39 sec INFO : 2017-04-07 10:17:38,336 Stage-0 map = 100%, reduce = 67%, Cumulative CPU 35089.77 sec INFO : 2017-04-07 10:18:38,842 Stage-0 map = 100%, reduce = 67%, Cumulative CPU 35124.49 sec
I think only
one reduce task leads to such long loading time (almost one hour) , and obviously my
set mapred.reduce.tasks=9; does not become effective, and according to my experience, there is no other methods for adjusting the number of reduce tasks.
So I want to know whether I should choose
elasticsearch-spark-20_2.11-5.2.2.jar to build spark application, where I can set the number of partitions for appropriate shards?