Spark sql to query array fields in elastic search


(Tridib) #1

I have integrated elasticsearch with spark using elastic-spark connector. I can query elastic search through spark.

But the query does not return any result when I put array fields in where clause.
Elastic search document looks like bellow.
{

"_index": "ppm",
"_type": "docs",
"_id": "AU-5lX7CCHEkhOBzqUsa",
"_version": 1,
"_score": 1,
"_source": {
    "member_id": 1,
     "year": "2015",
    "r": [
        "R1"
        ,
        "R3"
    ]
}

}
SQL used: select member_id from table_xxx where r = 'R1'
No result returned.

I tried hive function array_contains(r, 'R1'), but it says expected array found string. looks like elastic stores array as string type.
Is there a different way to query the array fields?
How can I see the elastic search native json query corresponding to SQL?

Thanks
-Tridib


(eliasah) #2

Can you elaborate on how you are querying Elasticsearch through Spark?


(Tridib) #3

I register the ES index "ppm" as "table_xxx ". Then I query "table_xxx". Standard spark sql. This setup works for other simple condion in where clause. But did not work for array type.


(eliasah) #4

You know, once the index is in Spark, it's out of the Elasticsearch scope query. To query the table you have created, you should considering using functions from the Spark SQL API.


(Tridib) #5

As I mentioned in my original post that spark sql query "array_contains(r, 'R1')" did not work with elastic search. According to elastic/hadoop connector this should work.


(eliasah) #6

Can you share the reference (documentation) where is says that this actions should work?


(Costin Leau) #7

This is a known issue stemming from the fact that ES doesn't treat arrays differently than single values. Which means when reading a mapping, one doesn't know whether a certain field is an array or not until after reading all the values (as the array can appear at any point in time).
Typically this is not an issue however since Spark SQL requires the schema to be known before hand and be fixed, when the underlying format changes, this causes issues.
There's an issue raised to address this which should hit master soon.


(Tridib) #8

From elastic hadoop docs and costin's presentation I got the message that all Spark SQL will work.


(Tridib) #9

Thanks for you response. Will wait for the fix.


(system) #10