Pyspark 3.3 with elasticsearch-spark-30_2.12-8.4.1.jar
(I've also tried with the elasticsearch-spark-30_2.12-717.6.jar with same result)
I have access to the index and have validated via curl and the normal python elasticsearch package.
Just to clarify, I have tested successfully reading the index with curl and the python elasticsearch-py package using the basic:auth approach and the
url is set up like:
client = Elasticsearch("https://test-elk.test.com:443/es/",basic_auth=(user, secret))
The failure occurs with attempting to query the index using pyspark to the remote elasticsearch nodes.
I should note, I have only read only access to the index.
I do use the "es.nodes.path.prefix" in the pyspark implementation:
es_reader = (spark.read
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only", 'true')
.option("es.nodes", "https://test-elk.test.com")
.option("es.nodes.path.prefix",'/es')
.option("es.port", 443)
.option("es.resource.read", 'logz-test-app')
.option("es.net.http.auth.user", user)
.option("es.net.http.auth.pass", secret)
.option("es.query", q2)
)
I get a good initial response in the debug output:
header: >> "GET /es/ HTTP/1.1[\r][\n]"
HttpMethodBase: Adding Host request header
header: >> "X-Opaque-ID: [spark] [root] [ES-Test] [local-1664037542900][\r][\n]"
header: >> "Content-Type: application/json[\r][\n]"
header: >> "Accept: application/json[\r][\n]"
header: >> "Authorization: Basic ****ZpY2VfYWNjb3VudDpu********[\r][\n]"
header: >> "User-Agent: Jakarta Commons-HttpClient/3.0.1[\r][\n]"
header: >> "Host: test-elk.test.com[\r][\n]"
header: >> "[\r][\n]"
header: << "HTTP/1.1 200 OK[\r][\n]"
header: << "X-Opaque-Id: [spark] [root] [ES-Test] [local-1664037542900][\r][\n]"
header: << "X-elastic-product: Elasticsearch[\r][\n]"
header: << "content-type: application/json; charset=UTF-8[\r][\n]"
header: << "content-length: 541[\r][\n]"
header: << "Strict-Transport-Security: max-age=15768000[\r][\n]"
content: << "{[\n]"
content: << " "name" : "es-client-3",[\n]"
content: << " "cluster_name" : "test-elk",[\n]"
content: << " "cluster_uuid" : "******-BcunQ",[\n]"
content: << " "version" : {[\n]"
content: << " "number" : "7.17.4",[\n]"
content: << " "build_flavor" : "default",[\n]"
content: << " "build_type" : "tar",[\n]"
content: << " "build_hash" : "*********",[\n]"
content: << " "build_date" : "2022-05-18T18:04:20.964345128Z",[\n]"
content: << " "build_snapshot" : false,[\n]"
content: << " "lucene_version" : "8.11.1",[\n]"
content: << " "minimum_wire_compatibility_version" : "6.8.0",[\n]"
content: << " "minimum_index_compatibility_version" : "6.0.0-beta1"[\n]"
content: << " },[\n]"
content: << " "tagline" : "You Know, for Search"[\n]"
content: << "}[\n]"
ElasticsearchRelation: Discovered Elasticsearch cluster [test-elk/04BInhHUQGmIbtXG-BcunQ], version [7.17.4]
The next response seems to add the prefix to the HEAD request '/es' and I'm not sure if this causes an issue with the attempt, or if there is just a permissions issue with the way pyspark needs the permissions?
HttpMethodDirector: Authenticating with BASIC <any realm>@xtest-elk.test:443
HttpConnection: Open connection to test-test.net:443
header: >> "HEAD /es/logz-test-app HTTP/1.1[\r][\n]"
HttpMethodBase: Adding Host request header
header: >> "X-Opaque-ID: [spark] [root] [ES-Test] [local-1664037816089][\r][\n]"
header: >> "Content-Type: application/json[\r][\n]"
header: >> "Accept: application/json[\r][\n]"
header: >> "Authorization: Basic ****ZpY2VfYWNjb3VudDpu********[\r][\n]"
header: >> "User-Agent: Jakarta Commons-HttpClient/3.0.1[\r][\n]"
header: >> "Host: test-elk.test.com[\r][\n]" 22/09/24 16:43:42 DEBUG header: >> "[\r][\n]"
header: << "HTTP/1.1 403 Forbidden[\r][\n]"
Does the addition of the prefix cause a malformed HEAD, or does the implementation require my permissions to be read/write?
Any help is greatly appreciated