I am trying to find the best way to read data from Elastic Search ( V: 5.1.1) through Apache Spark ( V: 2.2.1). I am using driver jar version ( elasticsearch-spark-20_2.11-5.3.1.jar).
My question is mainly around reading array fields. My documents schema are uniform with in an index type. So I am trying to utilize specifying the schema while reading.
for example,
df_ES_Index= spark.read
.format("org.elasticsearch.spark.sql")
.option("es.nodes","192.168.0.1:9200")
.option("schema",schema_index)
.load("index/index_type")
schema
schema_n_offset=StructType([
StructField("length",IntegerType(),True),
StructField("offset",StringType(),True)
])
schema_n_language=StructType([
StructField("field1",StringType(),True),
StructField("field2",StringType(),True),
StructField("field3",FloatType(),True),
StructField("offsets",ArrayType(schema_n_offset),True)
])
schema_index=StructField([
StructField("languages",ArrayType(schema_n_language),True)
])
Here, my first level field "languages" recognized correctly as "array" in Spark. however, field "offsets" with in "languages" field is read as "struct" type in Spark. this result in an error
Field 'languages.offsets' is backed by an array but the associated Spark Schema does not reflect this;
I know I can inlcude / exclude fields to bypass this error ("es.read.field.as.array.include").
I thought I don't need to go through that If I can specify the schema while reading data from ElasticSearch. Do anyone have suggestion to read nested / array fields data from ElasticSearch through spark?
Thanks!