Pyspark - read data from elasticsearch cluster on EMR

(Ruxiz) #1

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("ESTest")
sc = SparkContext(conf=conf)
es_read_conf = {
"" : "vhost",
"es.nodes" : "node",
"es.port" : "443",
#"es.query": '{ "query": { "match_all": {} } }',
#"es.input.json": "true",
"": "aws_access_key",
"": "aws_secret_key",
"": "true",
"es.resource" : "index/type",
"es.nodes.wan.only": "true"

es_rdd = sc.newAPIHadoopRDD(

Pyspark keeps throwing error:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.

: [HEAD] on [index] failed; servernode:443] returned [403|Forbidden:]

(Ruxiz) #2

I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Please advice! Thanks!

(Ruxiz) #3

Looks like this client just does not support AWS connector, and I need a custom AWS connect to connect to AWS EMR cluster. DO you have any suggestion what way I should do it?

(James Baiera) #4

I am not too familiar with integrating with technologies in AWS since we do not test against those environments at all in the connector. I will say that the and configurations are packaged together as a Basic Http authentication scheme header in Base64 format. If you need alternative headers to be sent, I would specify them by setting them in the configuration:

(system) closed #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.