I'm seeing a huge difference in time when running queries against a very small Elasticsearch index, I wondered if anyone could shed light on it (and suggest how to improve it).
You can try using 2.1.0.Beta4 though I doubt it will make a big difference.
There are several things at play here:
you are using the JDBC driver which will add some extra complexity - for example when selecting the column name; it should be faster than select * (and it is on the elasticsearch side) but gets more complicated on Hive part I suspect
DISTINCT COUNT is likely to involve additional map/reduce jobs - you can tell this from the console.
You can diagnose things yourself by:
a. looking at the calls made by the connector to Elasticsearch. They should be more or less the same and have the same speed. When specifying the column, you should see the project applied.
b. do the same calls from outside JDBC - by using the Hive shell directly and see whether there's any difference
c. see the work Hive executes with the data. The first select is basically pure streaming and hence why its fast (still Hive add some overhead 4-5 seconds). The rest likely involve additional processing which means map/reduce jobs - try to see whether switching to Tez improves things from an execution perspective (do not that a lot of folks reported problems with Tez from high memory consumption to crashes).
Additionally, if you really want to use SQL, you can try potentially Spark SQL (we support that as well in 2.1).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.