Spark SQL thriftserver DDL?


(Andrew Wooster) #1

I have a working example using the Elasticsearch/Hive SQL connector. It allows me to use Tableau to query Elasticsearch. Here is my working Hive SQL DDL:

CREATE EXTERNAL TABLE biblio_raw (
    url   STRING,
   content_text_array ARRAY<STRING>
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
  'es.port' = '11100', 
  'es.resource' = 'my_index/doc', 
  'es.mapping.names' = 'content_text_array:contentText'
);

The Hive queries are painfully slow. I now want to create a similar view using Spark so that I can take advantage of the elasticsearch / Spark push-down capabilities.

I start the Spark thrift server as follows:

$SPARK_HOME/sbin/start-thriftserver.sh --master spark://xxx.com:11407 --hiveconf hive.server2.thrift.port=11410 --jars ~/workspace/lib/elasticsearch-spark_2.10-2.2.0-rc1.jar

It is not clear to me what the equivalent Spark SQL DDL to create the view above. I tried the following without success:

CREATE EXTERNAL TABLE biblio_raw (
    url   STRING,
    content_text_array ARRAY<STRING>
)
STORED BY 'org.elasticsearch.spark.sql'
TBLPROPERTIES(
  'es.port' = '11100', 
  'es.resource' = 'my_index/doc', 
  'es.mapping.names' = 'content_text_array:contentText'
);

It failed with the error: Cannot find class 'org.elasticsearch.spark.sql'


(Andrew Wooster) #2

I've figured out the corresponding Spark SQL syntax. The following temporary table works:

CREATE TEMPORARY TABLE biblio_raw (
   url   STRING,
  content_text_array ARRAY<STRING>
)
USING org.elasticsearch.spark.sql 
OPTIONS (
  path 'my_index/doc', 
  query '?q=mouse',
  es.port '11100',
  es.mapping.names 'content_text_array:contentText'
);

HOWEVER, the es.mapping.names do NOT work. The above DDL does not map content_text_array SQL field to the contentText ES field. How do I express mappings with Elasticsearch/Spark SQL?


(Costin Leau) #3

es.mapping.names is not a global configuration - it is available in Hive (mainly because Hive has various issues including being case insensitive) while Spark does not. Use the upper case variable and you should be all set.

P.S. Note that the query param makes sense when reading; you can use it for writing however many find it confusing when they can't read back data that they write (since it does not match the query bound to the table).


(system) #4