How ES schema is determined while reading using hadoop

(Sonny Heer) #1

using this jar: elasticsearch-spark-13_2.10-5.1.1.jar"org.elasticsearch.spark.sql").option("es.nodes",es_url).option("es.port", "443").option("es.nodes.wan.only", "true").option("", "true").option("",array_with_comma).option("","false").option("", exclude_with_comma).option("", "").option("pushdown", "true").load(es_index)

not passing any args except for exclude fields in which case we exclude a couple from top level.

The problem we have is missing fields in the dataframe.printSchema()...

Does it use _mapping to figure out the schema or sampling? I didn't find any docs on this.


(James Baiera) #2

ES-Hadoop uses the mapping endpoint for the resource given, though in the 5.x line there is a bug when attempting to read from multiple indices or types: When the schema is discovered at the start of the process, only one mapping is picked up and used for the fields. I would check to make sure that you are only reading from one index and type, or if you are reading from multiple, ensure that their mappings are identical across the board

(Sonny Heer) #3

Thanks James! That helps. It appears the mapping isn't being updated when data is added by the other team - causing our issue. thanks again for confirming.