Unable to read elasticsearch.6.0.1 index data into dataframe using pyspark

vincentnaveen · February 22, 2024, 5:39am

Hello everyone, I am using elastcisearch version 6.0.1 from AWS service. I am trying to read Es index data into spark dataframe using pyspark.

I can read all fields except the fields contain nested arrays.
The nested array contains another nested array in it.

Mappings in index is below:

{
	"accounts": {
		"type": "nested",
		"properties": {
			"accountClassificationOne": {
				"type": "text",
				"fields": {
					"keyword": {
						"type": "keyword",
						"ignore_above": 256
					}
				}
			},
			"alternateNames": {
				"type": "nested",
				"properties": {
					"createDate": {
						"properties": {
							"chronology": {
								"type": "object"
							},
							"millis": {
								"type": "long"
							}
						}
					},
					"inactive": {
						"type": "boolean"
					},
					"name": {
						"type": "text",
						"fields": {
							"autocomplete": {
								"type": "text",
								"analyzer": "customer_synonym_autocomplete",
								"search_analyzer": "customer_synonym"
							},
							"de": {
								"type": "text",
								"analyzer": "customer_german_autocomplete",
								"search_analyzer": "german"
							},
							"full": {
								"type": "text",
								"analyzer": "customer_synonym_full"
							},
							"keyword": {
								"type": "keyword",
								"ignore_above": 256
							},
							"normalize": {
								"type": "keyword",
								"normalizer": "lowercase_normalizer"
							}
						},
						"analyzer": "customer_synonym"
					}
				},
				"badDebt": {
					"type": "boolean"
				}
			}
		}
	}
}

config in pyspark code trying to read is:

es_options_read = {
    "es.nodes": es_nodes,
    "es.port": 443,
    "es.resource": "index_name/type",
    "es.query": myquery,
    "es.nodes.wan.only": "true",
    "es.read.field.as.array.include": "accounts",
    "es.read.field.include": "accounts"
}

Error is: 
 org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'updateDate.chronology' not found; typically this occurs with arrays which are not mapped as single value

another error sometimes:    java.lang.NullPointerException.

i tried multiple combination in arrays.include, struct field, explode and many in read options but no luck. could anyone help me on this.

system · March 21, 2024, 5:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pyspark - read nested Object field from elasticsearch Elasticsearch es-hadoop	1	978	June 30, 2020
Best practise to read ES from PySpark Elasticsearch es-hadoop	5	7264	April 14, 2018
Query elasticseatrch with pyspark and nested fields Elasticsearch	0	14	December 10, 2024
Elastic-Spark connector : How to read data fro ES Index which has nested Json with array fields Elasticsearch es-hadoop	2	519	July 20, 2022
Field 'tags' is backed by an array but the associated Spark Schema does not reflect this Elasticsearch	1	573	September 3, 2019

Unable to read elasticsearch.6.0.1 index data into dataframe using pyspark

Related topics