Reading from ES using spark issues (colon in options)

Sonny_Heer · January 29, 2018, 5:17pm

We ran into various issues with empty arrays etc. after reading through the forums - we go past those. Currently when using "es.read.field.as.array.include" and the field happens to have a colon in it (e.g. a:b) - this causes a number format exception. The ES fields looks like this:

  "foo:bar": [
    
  ],
  "foo:bazz": [
    
  ],

and setting the option like so:

.option("es.read.field.as.array.include","foo:bar,foo:bazz")

exception:

: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Invalid parameter [foo:bar] specified in setting [foo:bar,foo:bazz]
at org.elasticsearch.hadoop.util.SettingsUtils.getFieldArrayFilterInclude(SettingsUtils.java:217)
at org.elasticsearch.spark.sql.SchemaUtils$.convertToStruct(SchemaUtils.scala:126)
at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:127)
at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:127)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:131)
at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:131)
at scala.Option.getOrElse(Option.scala:120)
at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:131)
at org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "bar"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.elasticsearch.hadoop.util.SettingsUtils.getFieldArrayFilterInclude(SettingsUtils.java:212)
... 22 more

Versions:

elasticsearch-spark-13_2.10-5.1.1.jar
spark 1.6
HDP 2.4 stack

Thanks

Sonny_Heer · January 29, 2018, 7:22pm

I was actually able to get around this by doing wildcard. e.g. .option("es.read.field.as.array.include","foo*")

this worked in our case for now...

Sonny_Heer · January 31, 2018, 10:23pm

Although still curious how to handle if there is a colon in the field name. e.g. assume i have one field that is an array and another that is isn't, but the prefix is the same. "foo*" wouldn't work in this case...

system · February 28, 2018, 10:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.