403 Forbidden when using '"es.nodes.path.prefix" pyspark 8.4.1 pyspark 3.3

Pyspark 3.3 with elasticsearch-spark-30_2.12-8.4.1.jar

(I've also tried with the elasticsearch-spark-30_2.12-717.6.jar with same result)

I have access to the index and have validated via curl and the normal python elasticsearch package.

Just to clarify, I have tested successfully reading the index with curl and the python elasticsearch-py package using the basic:auth approach and the
url is set up like:

client = Elasticsearch("https://test-elk.test.com:443/es/",basic_auth=(user, secret))

The failure occurs with attempting to query the index using pyspark to the remote elasticsearch nodes.
I should note, I have only read only access to the index.

I do use the "es.nodes.path.prefix" in the pyspark implementation:

es_reader = (spark.read
    .format("org.elasticsearch.spark.sql")
    .option("es.nodes.wan.only", 'true')
    .option("es.nodes", "https://test-elk.test.com")
    .option("es.nodes.path.prefix",'/es')
    .option("es.port", 443)
    .option("es.resource.read", 'logz-test-app')
    .option("es.net.http.auth.user", user)     
    .option("es.net.http.auth.pass", secret)
    .option("es.query", q2)
)

I get a good initial response in the debug output:

header: >> "GET /es/ HTTP/1.1[\r][\n]"
HttpMethodBase: Adding Host request header
header: >> "X-Opaque-ID: [spark] [root] [ES-Test] [local-1664037542900][\r][\n]"
header: >> "Content-Type: application/json[\r][\n]"
header: >> "Accept: application/json[\r][\n]"
header: >> "Authorization: Basic ****ZpY2VfYWNjb3VudDpu********[\r][\n]"
header: >> "User-Agent: Jakarta Commons-HttpClient/3.0.1[\r][\n]"
header: >> "Host: test-elk.test.com[\r][\n]"
header: >> "[\r][\n]"
header: << "HTTP/1.1 200 OK[\r][\n]"
header: << "X-Opaque-Id: [spark] [root] [ES-Test] [local-1664037542900][\r][\n]"
header: << "X-elastic-product: Elasticsearch[\r][\n]"
header: << "content-type: application/json; charset=UTF-8[\r][\n]"
header: << "content-length: 541[\r][\n]"
header: << "Strict-Transport-Security: max-age=15768000[\r][\n]"
content: << "{[\n]"
content: << "  "name" : "es-client-3",[\n]"
content: << "  "cluster_name" : "test-elk",[\n]"
content: << "  "cluster_uuid" : "******-BcunQ",[\n]"
content: << "  "version" : {[\n]"
content: << "    "number" : "7.17.4",[\n]"
content: << "    "build_flavor" : "default",[\n]"
content: << "    "build_type" : "tar",[\n]"
content: << "    "build_hash" : "*********",[\n]"
content: << "    "build_date" : "2022-05-18T18:04:20.964345128Z",[\n]"
content: << "    "build_snapshot" : false,[\n]"
content: << "    "lucene_version" : "8.11.1",[\n]"
content: << "    "minimum_wire_compatibility_version" : "6.8.0",[\n]"
content: << "    "minimum_index_compatibility_version" : "6.0.0-beta1"[\n]"
content: << "  },[\n]"
content: << "  "tagline" : "You Know, for Search"[\n]"
content: << "}[\n]"
ElasticsearchRelation: Discovered Elasticsearch cluster [test-elk/04BInhHUQGmIbtXG-BcunQ], version [7.17.4]

The next response seems to add the prefix to the HEAD request '/es' and I'm not sure if this causes an issue with the attempt, or if there is just a permissions issue with the way pyspark needs the permissions?

HttpMethodDirector: Authenticating with BASIC <any realm>@xtest-elk.test:443 
HttpConnection: Open connection to test-test.net:443 
header: >> "HEAD /es/logz-test-app HTTP/1.1[\r][\n]" 
HttpMethodBase: Adding Host request header 
header: >> "X-Opaque-ID: [spark] [root] [ES-Test] [local-1664037816089][\r][\n]" 
header: >> "Content-Type: application/json[\r][\n]" 
header: >> "Accept: application/json[\r][\n]" 
header: >> "Authorization: Basic ****ZpY2VfYWNjb3VudDpu********[\r][\n]" 
header: >> "User-Agent: Jakarta Commons-HttpClient/3.0.1[\r][\n]" 
header: >> "Host: test-elk.test.com[\r][\n]" 22/09/24 16:43:42 DEBUG header: >> "[\r][\n]"
header: << "HTTP/1.1 403 Forbidden[\r][\n]"

Does the addition of the prefix cause a malformed HEAD, or does the implementation require my permissions to be read/write?

Any help is greatly appreciated

Hi @rbigley Welcome to the Community!

I know nothing about pyspark... but I can tell you that is not a valid path to an index in elasticsearch

es.nodes.path.prefix (default empty)
Prefix to add to all requests made to Elasticsearch. Useful in environments where the cluster is proxied/routed under a certain path. For example, if the cluster is located at someaddress:someport/custom/path/prefix, one would set es.nodes.path.prefix to /custom/path/prefix.

So unless you are using a proxy or something that is routing / rewriting the URL I would take it out and try without it. If you are just ignore me...

Could you show the curl you used to access your index?

And what did this return

client = Elasticsearch("https://test-elk.test.com:443/es/",basic_auth=(user, secret))

403 is and auth issue... Then we could check the permissions.. can you access that resource outside of pyspark? like directly through curl

Because you may be able to hit the root / which give the elastic info but the index could require authorization... I would debug it in curl.. then pyspark.

Thanks for the input. The elasticsearch pyspark jar has control of the format, and I’m wondering if there is a bug, or the elasticsearch instance I’m connecting to is non-normal and the jar is not able to handle it.

I’ve looked at the python elasticsearch-py log output and it omits the /es prefix and only includes the index.

So back to my base question are you using a proxy?

What does curl

curl -u user:pw https://test-elk.test.com:443/es/logz-test-app

No proxy.

curl -XGET 'https://test-elk.eaas.test.com:443/es/logz-test-app/' --user api_svc_acct:pw

the return:
{"error":{"root_cause":[{"type":"security_exception",
"reason":"action [indices:admin/get] 
is unauthorized for user [api_svc_acct] with roles [api_service_account], 
this action is granted by the index privileges [view_index_metadata,manage,all]"}],
"type":"security_exception",
"reason":"action [indices:admin/get] is unauthorized for user 
[api_svc_acct] with roles [api_service_account], 
this action is granted by the index privileges [view_index_metadata,manage,all]"

If I change the url and add _search I get a result.

curl -XGET 'https://test-elk.test.com:443/es/logz-test-app/_search

So without the _search that accesses the mapping (schema)

So it looks like your user can read/search that index and but not the schema or mapping.

So that looks like is the same behavior you are seeing in your original post.

Not sure if pyspark requires the ability to read the mapping

BTW not that it really matters...

But a path like this with the/es usually indicates a proxy of some sort... which I do not think is the source of your issue

@rbigley could you post the pyspark code and es mappings needed to reproduce your problem, as well as any relevant logs from your spark driver or executors? I'm not sure that we've ever tested with a URL like that.
On a side note, you might want to check the scala version that's part of your spark installation if you haven't already. You're using the es-spark jar that works with scala 2.12, and it won't work with scala 2.13. If you're using scala 2.13, grab elasticsearch-spark-30_2.13-8.4.1.jar.

Addressing side note first: I have checked the scala version and made sure it matches the elasticsearch*.jar. (My first attempts did have the wrong jar and results in an error that says it can't find the scala method/library).


q2 = '{ "query": { "match_all": {} }}'

es_reader = (spark.read
    .format("org.elasticsearch.spark.sql")
    .option("inferSchema", "true")
    .option("es.nodes.wan.only", 'true')
    .option("es.nodes", "https://test-elk.test.com")
    .option("es.nodes.path.prefix",'/es')
    .option("es.port", 443)
    .option("es.resource.read", 'logz-test-app')
    .option("es.net.http.auth.user", user)     
    .option("es.net.http.auth.pass", secret)
    .option("es.query", q2)
)

result = reader.load()

I did get a chance to talk with the Elasticsearch administrators and they did say the '/es' does a redirect on the backend to the https://test-elk.test.com/es/logz-test-app/_search url. They are going to work on getting a way set up for me that doesn't re-direct.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.