Best practice elasticsearch index schema for Spark SQL


(Thomas Decaux) #1

Hello,

I am using elasticsearch with SparkSQL in order to query my data from Tableau, it's working for very simple index structure, but as soon as I add nested fields I got always exception such as:

Field 'product' not found; typically this occurs with arrays which are not mapped as single value

Hence my question, is here a best practice for define data schema? (just as avoid nested array maybe..) what exactly means this kind of error?

I am using Spark 1.5.2, with ES 2.1 and Hue notebooks:

CREATE TEMPORARY TABLE events_all USING org.elasticsearch.spark.sql OPTIONS (nodes "elasticsearch", path "events/events", read.field.include "event.*");

SELECT COUNT(*) FROM events_all

Will error:

org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'product' not found; typically this occurs with arrays which are not mapped as single value
at org.elasticsearch.spark.sql.RowValueReader$class.rowColumns(RowValueReader.scala:33)
at org.elasticsearch.spark.sql.ScalaRowValueReader.rowColumns(ScalaEsRowValueReader.scala:13)
at org.elasticsearch.spark.sql.ScalaRowValueReader.createMap(ScalaEsRowValueReader.scala:49)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:645)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:588)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:661)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:588)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:383)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:318)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:213)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:186)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:438)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:86)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)


(Costin Leau) #2

Hi,

Support for field arrays has been introduced two milestone versions ago in ES-Hadoop 2.2 and in rc1 also documented. Can you please review this section of the docs and report back if the issue persists?

Thanks,


(system) #3