SparkSQL and ElasticSearch not inferring JSON Schema correctly, possible bugs?

Aris_V · February 11, 2015, 1:18am

I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and
Spark/SparkSQL 1.2.0, from Costin Leau's advice.

I want to query ElasticSearch for a bunch of JSON documents from within
SparkSQL, and then use a SQL query to simply query for a column, which is
actually a JSON key -- normal things that SparkSQL does using the
SQLContext.jsonFile(filePath) facility. The difference I am using the
ElasticSearch container.

The big problem: when I do something like

SELECT jsonKeyA from tempTable;

I actually get the WRONG KEY out of the JSON documents! I discovered that
if I have JSON keys physically in the order D, C, B, A in the json
documents, the elastic search connector discovers those keys BUT then sorts
them alphabetically as A,B,C,D - so when I SELECT A from tempTable, I
actually get column D (because the physical JSONs had key D in the first
position). This only happens when reading from elasticsearch and SparkSQL.

It gets much worse: When a key is missing from one of the documents and
that key should be NULL, the whole application actually crashes and gives
me a java.lang.IndexOutOfBoundsException -- the schema that is inferred is
totally screwed up.

In the above example with physical JSONs containing keys in the order
D,C,B,A, if one of the JSON documents is missing the key/column I am
querying for, I get that java.lang.IndexOutOfBoundsException exception.

I am using the BUILD-SNAPSHOT because otherwise I couldn't build the
elasticsearch-spark project, Costin said so.

Any clues here? Any fixes?

Aris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d866e547-edf6-416f-92bb-8c61aac17d43%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aris_V · February 11, 2015, 9:45pm

Costin Leua saw this on the Spark User Mailing List, and I have filed this
as a bug in github:

github.com/elastic/elasticsearch-hadoop

Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

opened 09:43PM - 11 Feb 15 UTC

closed 02:32PM - 28 Apr 15 UTC

aris-vlasakakis-ck

invalid question :Spark v2.1.0.Beta4

I'm using ElasticSearch with elasticsearch-spark-BUILD-SNAPSHOT and Spark/SparkS…QL 1.2.0, from Costin Leau's advice. I want to query ElasticSearch for a bunch of JSON documents from within SparkSQL, and then use a SQL query to simply query for a column, which is actually a JSON key -- normal things that SparkSQL does using the SQLContext.jsonFile(filePath) facility. The difference I am using the ElasticSearch container. The big problem: when I do something like SELECT jsonKeyA from tempTable; I actually get the WRONG KEY out of the JSON documents! I discovered that if I have JSON keys physically in the order D, C, B, A in the json documents, the elastic search connector discovers those keys BUT then sorts them alphabetically as A,B,C,D - so when I SELECT A from tempTable, I actually get column D (because the physical JSONs had key D in the first position). This only happens when reading from elasticsearch and SparkSQL. It gets much worse: When a key is missing from one of the documents and that key should be NULL, the whole application actually crashes and gives me a java.lang.IndexOutOfBoundsException -- the schema that is inferred is totally screwed up. In the above example with physical JSONs containing keys in the order D,C,B,A, if one of the JSON documents is missing the key/column I am querying for, I get that java.lang.IndexOutOfBoundsException exception. I am using the BUILD-SNAPSHOT because otherwise I couldn't build the elasticsearch-spark project, Costin said so. Any clues here? Any fixes?

On Tuesday, February 10, 2015 at 5:18:57 PM UTC-8, Aris V wrote:

I'm using Elasticsearch with elasticsearch-spark-BUILD-SNAPSHOT and
Spark/SparkSQL 1.2.0, from Costin Leau's advice.

I want to query Elasticsearch for a bunch of JSON documents from within
SparkSQL, and then use a SQL query to simply query for a column, which is
actually a JSON key -- normal things that SparkSQL does using the
SQLContext.jsonFile(filePath) facility. The difference I am using the
Elasticsearch container.

The big problem: when I do something like

SELECT jsonKeyA from tempTable;

I actually get the WRONG KEY out of the JSON documents! I discovered that
if I have JSON keys physically in the order D, C, B, A in the json
documents, the Elasticsearch connector discovers those keys BUT then sorts
them alphabetically as A,B,C,D - so when I SELECT A from tempTable, I
actually get column D (because the physical JSONs had key D in the first
position). This only happens when reading from elasticsearch and SparkSQL.

It gets much worse: When a key is missing from one of the documents and
that key should be NULL, the whole application actually crashes and gives
me a java.lang.IndexOutOfBoundsException -- the schema that is inferred is
totally screwed up.

In the above example with physical JSONs containing keys in the order
D,C,B,A, if one of the JSON documents is missing the key/column I am
querying for, I get that java.lang.IndexOutOfBoundsException exception.

I am using the BUILD-SNAPSHOT because otherwise I couldn't build the
elasticsearch-spark project, Costin said so.

Any clues here? Any fixes?

Aris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ebb742a1-17d5-4c04-8c5c-221361699fde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
ElasticSearch and JSON SparkSQL Elasticsearch es-hadoop	1	1111	January 9, 2016
Spark-sql does not seem to read from a nested schema Elasticsearch es-hadoop	14	7779	January 9, 2016
[Spark] SchemaRdd saveToEs produces "Bad JSON" errors Elasticsearch	1	717	March 24, 2015
Cannot read from Elasticsearch using Spark SQL Elasticsearch	3	1326	October 10, 2016
Best practice elasticsearch index schema for Spark SQL Elasticsearch es-hadoop	1	1792	January 9, 2016

SparkSQL and ElasticSearch not inferring JSON Schema correctly, possible bugs?

Related topics