SparkSQL and ElasticSearch not inferring JSON Schema correctly, possible bugs?

I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and
Spark/SparkSQL 1.2.0, from Costin Leau's advice.

I want to query ElasticSearch for a bunch of JSON documents from within
SparkSQL, and then use a SQL query to simply query for a column, which is
actually a JSON key -- normal things that SparkSQL does using the
SQLContext.jsonFile(filePath) facility. The difference I am using the
ElasticSearch container.

The big problem: when I do something like

SELECT jsonKeyA from tempTable;

I actually get the WRONG KEY out of the JSON documents! I discovered that
if I have JSON keys physically in the order D, C, B, A in the json
documents, the elastic search connector discovers those keys BUT then sorts
them alphabetically as A,B,C,D - so when I SELECT A from tempTable, I
actually get column D (because the physical JSONs had key D in the first
position). This only happens when reading from elasticsearch and SparkSQL.

It gets much worse: When a key is missing from one of the documents and
that key should be NULL, the whole application actually crashes and gives
me a java.lang.IndexOutOfBoundsException -- the schema that is inferred is
totally screwed up.

In the above example with physical JSONs containing keys in the order
D,C,B,A, if one of the JSON documents is missing the key/column I am
querying for, I get that java.lang.IndexOutOfBoundsException exception.

I am using the BUILD-SNAPSHOT because otherwise I couldn't build the
elasticsearch-spark project, Costin said so.

Any clues here? Any fixes?

Aris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d866e547-edf6-416f-92bb-8c61aac17d43%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Costin Leua saw this on the Spark User Mailing List, and I have filed this
as a bug in github:

On Tuesday, February 10, 2015 at 5:18:57 PM UTC-8, Aris V wrote:

I'm using Elasticsearch with elasticsearch-spark-BUILD-SNAPSHOT and
Spark/SparkSQL 1.2.0, from Costin Leau's advice.

I want to query Elasticsearch for a bunch of JSON documents from within
SparkSQL, and then use a SQL query to simply query for a column, which is
actually a JSON key -- normal things that SparkSQL does using the
SQLContext.jsonFile(filePath) facility. The difference I am using the
Elasticsearch container.

The big problem: when I do something like

SELECT jsonKeyA from tempTable;

I actually get the WRONG KEY out of the JSON documents! I discovered that
if I have JSON keys physically in the order D, C, B, A in the json
documents, the Elasticsearch connector discovers those keys BUT then sorts
them alphabetically as A,B,C,D - so when I SELECT A from tempTable, I
actually get column D (because the physical JSONs had key D in the first
position). This only happens when reading from elasticsearch and SparkSQL.

It gets much worse: When a key is missing from one of the documents and
that key should be NULL, the whole application actually crashes and gives
me a java.lang.IndexOutOfBoundsException -- the schema that is inferred is
totally screwed up.

In the above example with physical JSONs containing keys in the order
D,C,B,A, if one of the JSON documents is missing the key/column I am
querying for, I get that java.lang.IndexOutOfBoundsException exception.

I am using the BUILD-SNAPSHOT because otherwise I couldn't build the
elasticsearch-spark project, Costin said so.

Any clues here? Any fixes?

Aris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ebb742a1-17d5-4c04-8c5c-221361699fde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.