Performance of Hive queries

rmoff · May 19, 2015, 2:26pm

Hi,

I'm seeing a huge difference in time when running queries against a very small Elasticsearch index, I wondered if anyone could shed light on it (and suggest how to improve it).

I have a single document in Elasticsearch:

curl -XPOST 'http://es-host:9200/viz/characters/' -d '{"name":"finbarr saunders"}'

I then create my Hive table:

CREATE EXTERNAL TABLE es_foo (name string) 
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe' 
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' 
TBLPROPERTIES ('es.nodes'='es-host','es.resource'='viz/characters');

A SELECT * is fairly fast:

0: jdbc:hive2://localhost:10000> select * from es_foo;
+-------------------+
|       name        |
+-------------------+
| finbarr saunders  |
+-------------------+
1 row selected (6.475 seconds)

But a COUNT DISTINCT or named-column SELECT is awful:

0: jdbc:hive2://localhost:10000> select count(distinct name) from es_foo;
+------+
| _c0  |
+------+
| 1    |
+------+
1 row selected (102.79 seconds)

0: jdbc:hive2://localhost:10000> select name from es_foo;
+-------------------+
|       name        |
+-------------------+
| finbarr saunders  |
+-------------------+
1 row selected (80.341 seconds)
0: jdbc:hive2://localhost:10000>

Any suggestions?

This is with elasticsearch-hadoop-2.0.2, CDH 5.

thanks.

costin · May 20, 2015, 9:07am

You can try using 2.1.0.Beta4 though I doubt it will make a big difference.
There are several things at play here:

you are using the JDBC driver which will add some extra complexity - for example when selecting the column name; it should be faster than select * (and it is on the elasticsearch side) but gets more complicated on Hive part I suspect
DISTINCT COUNT is likely to involve additional map/reduce jobs - you can tell this from the console.

You can diagnose things yourself by:
a. looking at the calls made by the connector to Elasticsearch. They should be more or less the same and have the same speed. When specifying the column, you should see the project applied.
b. do the same calls from outside JDBC - by using the Hive shell directly and see whether there's any difference

c. see the work Hive executes with the data. The first select is basically pure streaming and hence why its fast (still Hive add some overhead 4-5 seconds). The rest likely involve additional processing which means map/reduce jobs - try to see whether switching to Tez improves things from an execution perspective (do not that a lot of folks reported problems with Tez from high memory consumption to crashes).

Additionally, if you really want to use SQL, you can try potentially Spark SQL (we support that as well in 2.1).

rmoff · May 21, 2015, 9:00am

Thanks for the feedback, I'll have a look into it.

Topic		Replies	Views
Hive external table performance issue Elasticsearch es-hadoop	3	1701	August 24, 2018
ES-Hadoop 5.1.1 performance Elasticsearch es-hadoop	3	916	February 14, 2017
Hive queries to read ES data taking too long \| Need suggestions to improve Elasticsearch es-hadoop	4	2252	July 6, 2017
How to get a better performance to load ElasticSearch data into Hive? Elasticsearch es-hadoop	1	399	February 22, 2021
Hive read es data slow Elasticsearch es-hadoop	5	1145	December 20, 2019

Performance of Hive queries

Related topics