Found duplicate column(s) in the data schema, Need help on how to load such index data into Spark Dataframe


(Yasmeen Chakrayapeta) #1

Hi Team,

I am trying to read data from elasticsearch index and write into a spark dataframe, but the index has same field name with different cases(upper/lower case)

below is the mapping, and the error I am getting is pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) in the data schema: providercolumn;'

Can you please help on how do I deal with this scenario

{
“INDEXNAME”: {
"mappings": {
“Type”: {
"properties": {
"Providercolumn”: {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"providercolumn”: {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}


(James Baiera) #2

It seems that Spark is not case sensitive when determining field names. It's most likely a good idea to change the names of these columns if possible, or perhaps even direct Spark to ignore one of them in order for it to successfully read: es.read.field.exclude = providercolumn