I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf
sc.stop()
conf = SparkConf().setAppName("ESTest")
sc = SparkContext(conf=conf)
es_read_conf = {
"es.host" : "vhost",
"es.nodes" : "node",
"es.port" : "443",
#"es.query": '{ "query": { "match_all": {} } }',
#"es.input.json": "true",
"es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key",
"es.net.ssl": "true",
"es.resource" : "index/type",
"es.nodes.wan.only": "true"
}es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [index] failed; servernode:443] returned [403|Forbidden:]