Performance of Hive queries


I'm seeing a huge difference in time when running queries against a very small Elasticsearch index, I wondered if anyone could shed light on it (and suggest how to improve it).

I have a single document in Elasticsearch:

curl -XPOST 'http://es-host:9200/viz/characters/' -d '{"name":"finbarr saunders"}'

I then create my Hive table:

CREATE EXTERNAL TABLE es_foo (name string) 
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe' 
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' 
TBLPROPERTIES ('es.nodes'='es-host','es.resource'='viz/characters');

A SELECT * is fairly fast:

0: jdbc:hive2://localhost:10000> select * from es_foo;
|       name        |
| finbarr saunders  |
1 row selected (6.475 seconds)

But a COUNT DISTINCT or named-column SELECT is awful:

0: jdbc:hive2://localhost:10000> select count(distinct name) from es_foo;
| _c0  |
| 1    |
1 row selected (102.79 seconds)

0: jdbc:hive2://localhost:10000> select name from es_foo;
|       name        |
| finbarr saunders  |
1 row selected (80.341 seconds)
0: jdbc:hive2://localhost:10000>

Any suggestions?

This is with elasticsearch-hadoop-2.0.2, CDH 5.


You can try using 2.1.0.Beta4 though I doubt it will make a big difference.
There are several things at play here:

  1. you are using the JDBC driver which will add some extra complexity - for example when selecting the column name; it should be faster than select * (and it is on the elasticsearch side) but gets more complicated on Hive part I suspect
  2. DISTINCT COUNT is likely to involve additional map/reduce jobs - you can tell this from the console.

You can diagnose things yourself by:
a. looking at the calls made by the connector to Elasticsearch. They should be more or less the same and have the same speed. When specifying the column, you should see the project applied.
b. do the same calls from outside JDBC - by using the Hive shell directly and see whether there's any difference

c. see the work Hive executes with the data. The first select is basically pure streaming and hence why its fast (still Hive add some overhead 4-5 seconds). The rest likely involve additional processing which means map/reduce jobs - try to see whether switching to Tez improves things from an execution perspective (do not that a lot of folks reported problems with Tez from high memory consumption to crashes).

Additionally, if you really want to use SQL, you can try potentially Spark SQL (we support that as well in 2.1).

Thanks for the feedback, I'll have a look into it.