Pyspark - read data from elasticsearch cluster on EMR

ruxiz · November 28, 2018, 4:44am

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:

from pyspark import SparkContext, SparkConf
sc.stop()
conf = SparkConf().setAppName("ESTest")
sc = SparkContext(conf=conf)
es_read_conf = {
"es.host" : "vhost",
"es.nodes" : "node",
"es.port" : "443",
#"es.query": '{ "query": { "match_all": {} } }',
#"es.input.json": "true",
"es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key",
"es.net.ssl": "true",
"es.resource" : "index/type",
"es.nodes.wan.only": "true"
}

es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_read_conf)

Pyspark keeps throwing error:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.

: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [index] failed; servernode:443] returned [403|Forbidden:]

ruxiz · November 28, 2018, 4:47am

I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Please advice! Thanks!

ruxiz · November 28, 2018, 5:40pm

Looks like this client just does not support AWS connector, and I need a custom AWS connect to connect to AWS EMR cluster. DO you have any suggestion what way I should do it?

james.baiera · December 12, 2018, 7:48pm

I am not too familiar with integrating with technologies in AWS since we do not test against those environments at all in the connector. I will say that the es.net.https.auth.user and es.net.https.auth.pass configurations are packaged together as a Basic Http authentication scheme header in Base64 format. If you need alternative headers to be sent, I would specify them by setting them in the configuration: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html#_setting_http_request_headers

system · January 9, 2019, 7:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with retrieving data from ES into Spark Elasticsearch es-hadoop	3	4246	July 6, 2017
Unable to connect to AWS Elastic-Search Instance through AWS Spark EMR Elasticsearch	4	1142	August 7, 2020
Unable to integrate Spark on EMR with Amazon ELasticsearch Elasticsearch es-hadoop	2	2506	March 2, 2017
Connection Spark and ElasticSearch Elasticsearch es-hadoop	3	3314	August 27, 2017
How do I connect PySpark to Elasticsearch with SSL and verify certs set to False? Ask Question Elasticsearch es-hadoop	5	5674	March 7, 2019

Pyspark - read data from elasticsearch cluster on EMR

Related topics