ES-Spark : issue with 'read_field_include' in case of nested objects


(Preeti Raj - Buchhada) #1

A typical record in my ES index looks like:

"_source": {
      "app": "panoply",
      "response": {
         "category": "uncategorized",
         "subcategory": "uncategorized",
         "activity_common_name": "name123",
         "score": 0,
         "duration_secs": 2,
         "sub_activity": null,
         "activity": "name123"
      },
      "member_id": 2357919,
      "device_user_identity": 1688734,
      "activity_type": "type123",
      "response_timestamp": "2016-01-10T23:05:18.000Z"
   }

When I created a TABLE using Spark Shell as follows:

sql("""
      CREATE TEMPORARY TABLE jan10
      USING org.elasticsearch.spark.sql
      OPTIONS (
        resource 'cortez/data',
        nodes 'localhost',
        port '9201',
        scroll_size '500',
        query '?response_timestamp:[2016-01-01 TO 2016-01-10]',
        read_field_include 'member_id,response.category,response.subcategory,response.activity,response.activity_common_name,response.duration_secs,response.sub_activity,response_timestamp'
      ) """)

**Note:** response.score is not included in 'read_field_include'

and executed

sql("""SELECT * from jan10""").show()

I observed that all field values after response.score (namely duration_secs, sub_activity and activity) are showing up as null.
If I add response.score to 'read_field_include', all filed values are fetched correctly.

Seems like a bug.
Can you please check.
Thanks.


(Costin Leau) #2

It looks like a bug that probably triggers skipping of nested fields.
What version of ES-Hadoop are you using?

Cheers,


(Preeti Raj - Buchhada) #3

Environment:
ES: 1.3.2
es-hadoop: elasticsearch-hadoop-2.3.0
Spark: spark-1.6.1-bin-hadoop2.6


(system) #4