Hive read es data slow

hive version 1.2.1
es version 5.5.0

es.index: data_monthly 10 shard

this is hsql

id string,
uid string,
wb_name string,
platform string,
comment_count int,
fetch_time timestamp,
play_count int,
favorite_count int,
repost_count int,
monthly_net_inc_favorite_count int ,
monthly_net_inc_play_count int,
monthly_net_inc_comment_count int,
release_time timestamp
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
'es.nodes' = ',',
'' = 'false',
'es.resource' = 'data_monthly',
'' = 'true',
'es.mapping.names' = 'id:_metadata._id, uid:UID');

read data from es into hive ,

why 10 shard but 2 map ?
why slow

please help me , Thank

Please give me some advice.

Hi @yousanghz.

is your data locality is ES?

if you are using data locality as ES then it will take more time because of network bandwidth !

its better to use DSL query for searching ES data!


Thank you for your reply, I am very happy to receive a reply.

Please forgive me for my bad English.

My data is indeed stored in ES, but why 10 shard == 2 map ?

i try to use 20 shard index , but start up 2 map

I saw the official document writes ' In short, roughly speaking more input splits means more tasks that can read at the same time, different parts of the source. More shards means more buckets from which to read an index content (at the same time). '.

Did I understand it right?

Thank you again!

Hi @yousanghz.

Thank you !

as per your reply : -

you :split means more task:- but me:split and shards are same concept in ES.

you: More shards means more buckets but me : buckets concept is totally diff with shards.
you:i try to use 20 shard index , but start up 2 map but me:what is map here your are getting number of map 2 while running the query from Hive?

your ES Cluster decides the number of shards and number of cluster nodes.

finally : you are applying the concept of Hadoop with ES.

correct me if i am wrong !!!


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.