My hive has more than 30 nodes, and my table's space is almost 140GB, another, my elasticsearch cluster ( 3 data nodes with 8 cores/16G memory) is isolated from the hive. Now,
I want to load data from hive into es according Apache Hive integration.
The following is my hiveQL script:
add jar elasticsearch-hadoop-5.2.2.jar;
drop table database_X.artists;
CREATE EXTERNAL TABLE database_X.artists(
user_id string,
province int ,
...
col34 string) -- the table has 34 columns
stored by 'org.elasticsearch.hadoop.hive.EsStorageHandler'
tblproperties('es.resource' = 'dillon_pengcz/artists', 'es.nodes' = '172.21.8.24', 'es.index.auto.create' = 'true', 'es.mapping.id'='caa_id', 'es.batch.size.entries'='0', 'es.batch.size.bytes' = '4mb');
insert overwrite table database_X.artists select * from database_X.artists_src;
'172.21.8.24' is my ES master node ip
These days I can not successfully executed the above script. So I successfully tested 100000
records through limit as follows:
insert overwrite table database_X.artists select * from database_X.artists_src limit 100000;
And I used tcpflow -p -c -i eth1 port 9200
to find what's happening. But I found something different from my understanding:
In my master node 172.21.8.24
, I got many many POST bulk request as follows:
172.021.008.024.56340-172.021.008.034.09200: POST /_bulk HTTP/1.1^M
User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.19.1 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2^M
Host: 172.21.8.34:9200^M
Accept: /^M
Content-Length: 17500403^M
Content-Type: application/x-www-form-urlencoded^M
Expect: 100-continue^M
^M
172.021.008.024.09200-172.021.008.034.56340: HTTP/1.1 100 Continue^M
^M