403 Forbidden when using '"es.nodes.path.prefix" pyspark 8.4.1 pyspark 3.3

rbigley · September 24, 2022, 5:08pm

Pyspark 3.3 with elasticsearch-spark-30_2.12-8.4.1.jar

(I've also tried with the elasticsearch-spark-30_2.12-717.6.jar with same result)

I have access to the index and have validated via curl and the normal python elasticsearch package.

Just to clarify, I have tested successfully reading the index with curl and the python elasticsearch-py package using the basic:auth approach and the
url is set up like:

client = Elasticsearch("https://test-elk.test.com:443/es/",basic_auth=(user, secret))

The failure occurs with attempting to query the index using pyspark to the remote elasticsearch nodes.
I should note, I have only read only access to the index.

I do use the "es.nodes.path.prefix" in the pyspark implementation:

es_reader = (spark.read
    .format("org.elasticsearch.spark.sql")
    .option("es.nodes.wan.only", 'true')
    .option("es.nodes", "https://test-elk.test.com")
    .option("es.nodes.path.prefix",'/es')
    .option("es.port", 443)
    .option("es.resource.read", 'logz-test-app')
    .option("es.net.http.auth.user", user)     
    .option("es.net.http.auth.pass", secret)
    .option("es.query", q2)
)

I get a good initial response in the debug output:

header: >> "GET /es/ HTTP/1.1[\r][\n]"
HttpMethodBase: Adding Host request header
header: >> "X-Opaque-ID: [spark] [root] [ES-Test] [local-1664037542900][\r][\n]"
header: >> "Content-Type: application/json[\r][\n]"
header: >> "Accept: application/json[\r][\n]"
header: >> "Authorization: Basic ****ZpY2VfYWNjb3VudDpu********[\r][\n]"
header: >> "User-Agent: Jakarta Commons-HttpClient/3.0.1[\r][\n]"
header: >> "Host: test-elk.test.com[\r][\n]"
header: >> "[\r][\n]"
header: << "HTTP/1.1 200 OK[\r][\n]"
header: << "X-Opaque-Id: [spark] [root] [ES-Test] [local-1664037542900][\r][\n]"
header: << "X-elastic-product: Elasticsearch[\r][\n]"
header: << "content-type: application/json; charset=UTF-8[\r][\n]"
header: << "content-length: 541[\r][\n]"
header: << "Strict-Transport-Security: max-age=15768000[\r][\n]"
content: << "{[\n]"
content: << "  "name" : "es-client-3",[\n]"
content: << "  "cluster_name" : "test-elk",[\n]"
content: << "  "cluster_uuid" : "******-BcunQ",[\n]"
content: << "  "version" : {[\n]"
content: << "    "number" : "7.17.4",[\n]"
content: << "    "build_flavor" : "default",[\n]"
content: << "    "build_type" : "tar",[\n]"
content: << "    "build_hash" : "*********",[\n]"
content: << "    "build_date" : "2022-05-18T18:04:20.964345128Z",[\n]"
content: << "    "build_snapshot" : false,[\n]"
content: << "    "lucene_version" : "8.11.1",[\n]"
content: << "    "minimum_wire_compatibility_version" : "6.8.0",[\n]"
content: << "    "minimum_index_compatibility_version" : "6.0.0-beta1"[\n]"
content: << "  },[\n]"
content: << "  "tagline" : "You Know, for Search"[\n]"
content: << "}[\n]"
ElasticsearchRelation: Discovered Elasticsearch cluster [test-elk/04BInhHUQGmIbtXG-BcunQ], version [7.17.4]

The next response seems to add the prefix to the HEAD request '/es' and I'm not sure if this causes an issue with the attempt, or if there is just a permissions issue with the way pyspark needs the permissions?

HttpMethodDirector: Authenticating with BASIC <any realm>@xtest-elk.test:443 
HttpConnection: Open connection to test-test.net:443 
header: >> "HEAD /es/logz-test-app HTTP/1.1[\r][\n]" 
HttpMethodBase: Adding Host request header 
header: >> "X-Opaque-ID: [spark] [root] [ES-Test] [local-1664037816089][\r][\n]" 
header: >> "Content-Type: application/json[\r][\n]" 
header: >> "Accept: application/json[\r][\n]" 
header: >> "Authorization: Basic ****ZpY2VfYWNjb3VudDpu********[\r][\n]" 
header: >> "User-Agent: Jakarta Commons-HttpClient/3.0.1[\r][\n]" 
header: >> "Host: test-elk.test.com[\r][\n]" 22/09/24 16:43:42 DEBUG header: >> "[\r][\n]"
header: << "HTTP/1.1 403 Forbidden[\r][\n]"

Does the addition of the prefix cause a malformed HEAD, or does the implementation require my permissions to be read/write?

Any help is greatly appreciated

stephenb · September 25, 2022, 1:30am

Hi @rbigley Welcome to the Community!

I know nothing about pyspark... but I can tell you that is not a valid path to an index in elasticsearch

es.nodes.path.prefix (default empty)
Prefix to add to all requests made to Elasticsearch. Useful in environments where the cluster is proxied/routed under a certain path. For example, if the cluster is located at someaddress:someport/custom/path/prefix, one would set es.nodes.path.prefix to /custom/path/prefix.

So unless you are using a proxy or something that is routing / rewriting the URL I would take it out and try without it. If you are just ignore me...

Could you show the curl you used to access your index?

And what did this return

client = Elasticsearch("https://test-elk.test.com:443/es/",basic_auth=(user, secret))

403 is and auth issue... Then we could check the permissions.. can you access that resource outside of pyspark? like directly through curl

Because you may be able to hit the root / which give the elastic info but the index could require authorization... I would debug it in curl.. then pyspark.

rbigley · September 25, 2022, 2:04am

Thanks for the input. The elasticsearch pyspark jar has control of the format, and I’m wondering if there is a bug, or the elasticsearch instance I’m connecting to is non-normal and the jar is not able to handle it.

I’ve looked at the python elasticsearch-py log output and it omits the /es prefix and only includes the index.

stephenb · September 25, 2022, 2:15am

So back to my base question are you using a proxy?

What does curl

curl -u user:pw https://test-elk.test.com:443/es/logz-test-app

rbigley · September 25, 2022, 2:30am

No proxy.

curl -XGET 'https://test-elk.eaas.test.com:443/es/logz-test-app/' --user api_svc_acct:pw

the return:
{"error":{"root_cause":[{"type":"security_exception",
"reason":"action [indices:admin/get] 
is unauthorized for user [api_svc_acct] with roles [api_service_account], 
this action is granted by the index privileges [view_index_metadata,manage,all]"}],
"type":"security_exception",
"reason":"action [indices:admin/get] is unauthorized for user 
[api_svc_acct] with roles [api_service_account], 
this action is granted by the index privileges [view_index_metadata,manage,all]"

If I change the url and add _search I get a result.

curl -XGET 'https://test-elk.test.com:443/es/logz-test-app/_search

stephenb · September 25, 2022, 2:40am

So without the _search that accesses the mapping (schema)

So it looks like your user can read/search that index and but not the schema or mapping.

So that looks like is the same behavior you are seeing in your original post.

Not sure if pyspark requires the ability to read the mapping

BTW not that it really matters...

But a path like this with the/es usually indicates a proxy of some sort... which I do not think is the source of your issue

Keith_Massey · September 26, 2022, 1:58pm

@rbigley could you post the pyspark code and es mappings needed to reproduce your problem, as well as any relevant logs from your spark driver or executors? I'm not sure that we've ever tested with a URL like that.
On a side note, you might want to check the scala version that's part of your spark installation if you haven't already. You're using the es-spark jar that works with scala 2.12, and it won't work with scala 2.13. If you're using scala 2.13, grab elasticsearch-spark-30_2.13-8.4.1.jar.

rbigley · September 26, 2022, 6:55pm

Addressing side note first: I have checked the scala version and made sure it matches the elasticsearch*.jar. (My first attempts did have the wrong jar and results in an error that says it can't find the scala method/library).


q2 = '{ "query": { "match_all": {} }}'

es_reader = (spark.read
    .format("org.elasticsearch.spark.sql")
    .option("inferSchema", "true")
    .option("es.nodes.wan.only", 'true')
    .option("es.nodes", "https://test-elk.test.com")
    .option("es.nodes.path.prefix",'/es')
    .option("es.port", 443)
    .option("es.resource.read", 'logz-test-app')
    .option("es.net.http.auth.user", user)     
    .option("es.net.http.auth.pass", secret)
    .option("es.query", q2)
)

result = reader.load()

rbigley · September 27, 2022, 3:42pm

I did get a chance to talk with the Elasticsearch administrators and they did say the '/es' does a redirect on the backend to the https://test-elk.test.com/es/logz-test-app/_search url. They are going to work on getting a way set up for me that doesn't re-direct.

system · October 25, 2022, 3:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch-hadoop Connector Elasticsearch es-hadoop	1	711	August 26, 2020
asticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] 403 Elasticsearch	1	768	January 12, 2018
Read from ES index fails without write permission with HEAD [405\|Method Not Allowed:] Elasticsearch es-hadoop	11	1628	May 1, 2023
Prefix and port appear flipped in es-hadoop implementation Elasticsearch es-hadoop	4	410	April 17, 2023
Basic Authentication with Spark fails with 403(forbidden) Elasticsearch es-hadoop	3	2700	July 6, 2017

403 Forbidden when using '"es.nodes.path.prefix" pyspark 8.4.1 pyspark 3.3

Related topics