Found duplicate column(s) in the data schema, Need help on how to load such index data into Spark Dataframe

Yasmeenc · February 7, 2019, 7:25pm

Hi Team,

I am trying to read data from elasticsearch index and write into a spark dataframe, but the index has same field name with different cases(upper/lower case)

below is the mapping, and the error I am getting is pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) in the data schema: providercolumn;'

Can you please help on how do I deal with this scenario

{
“INDEXNAME”: {
"mappings": {
“Type”: {
"properties": {
"Providercolumn”: {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"providercolumn”: {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}

james.baiera · February 11, 2019, 9:50pm

It seems that Spark is not case sensitive when determining field names. It's most likely a good idea to change the names of these columns if possible, or perhaps even direct Spark to ignore one of them in order for it to successfully read: es.read.field.exclude = providercolumn

system · March 11, 2019, 9:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error load as a DataFrame Elasticsearch es-hadoop	6	1762	July 6, 2017
Crash when reading DataFrame Elasticsearch es-hadoop	3	1418	July 6, 2017
Best practice elasticsearch index schema for Spark SQL Elasticsearch es-hadoop	2	1777	July 6, 2017
SparkSQL + ElasticSeaerch = Cannot find mapping Elasticsearch	1	2527	July 5, 2017
SparkSQL and ElasticSearch not inferring JSON Schema correctly, possible bugs? Elasticsearch	2	515	July 6, 2017

Found duplicate column(s) in the data schema, Need help on how to load such index data into Spark Dataframe

Related topics