ElascticSearch and Spark fails with 403(forbidden)

Continuing the discussion from Basic Authentication with Spark fails with 403(forbidden):

Hi
I have the same issue (maybe), I noticed i get the log messages

WARN HttpMethodDirector: Required credentials not available for BASIC <any realm>@localhost:8080
WARN HttpMethodDirector: Preemptive authentication requested but no default credentials available

Can you explain how to define these or do i have some other proxy issue.

SBT config:

scalaVersion := "2.10.6"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0"
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0"

Spark Config:
 val conf = new SparkConf().setAppName("Spark ElasticSearch")
    conf.setMaster("local[1]") 
    .set("es.nodes", "localhost")
    .set("es.port", "8080") 
    .set("es.nodes.discovery", "false")
    //.set("es.net.http.auth.user", "username")
    //.set("es.net.http.auth.pass", "password")

     .set("es.net.proxy.http.host","proxy.host.com.au")
     .set("es.net.proxy.http.port","8080")
     .set("es.net.proxy.http.user","username")
     .set("es.net.proxy.http.pass","password")

Log:

 DefaultHttpParams: Set parameter http.useragent = Jakarta Commons-HttpClient/3.1
 DefaultHttpParams: Set parameter http.protocol.version = HTTP/1.1
 DefaultHttpParams: Set parameter http.connection-manager.class = class org.apache.commons.httpclient.SimpleHttpConnectionManager
 DefaultHttpParams: Set parameter http.protocol.cookie-policy = default
 DefaultHttpParams: Set parameter http.protocol.element-charset = US-ASCII
 DefaultHttpParams: Set parameter http.protocol.content-charset = ISO-8859-1
 DefaultHttpParams: Set parameter http.method.retry-handler = org.apache.commons.httpclient.DefaultHttpMethodRetryHandler@5c2fb6d8
 DefaultHttpParams: Set parameter http.method.retry-handler = org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport$1@48ded677
 DefaultHttpParams: Set parameter http.connection-manager.timeout = 60000
 DefaultHttpParams: Set parameter http.socket.timeout = 60000
 CommonsHttpTransport: Using authenticated HTTP proxy [proxy.host.com.au:8080]
 HttpClient: Java version: 1.8.0_66
 HttpClient: Java vendor: Oracle Corporation
 HttpClient: Java class path: myclasspath
 HttpClient: Operating system name: Windows 7
 HttpClient: Operating system architecture: amd64
 HttpClient: Operating system version: 6.1
 HttpClient: SUN 1.8: SUN (DSA key/parameter generation;...
 HttpClient: SunRsaSign 1.8: Sun RSA signature provider
 HttpClient: SunEC 1.8: Sun Elliptic Curve provider (EC, ECDSA, ECDH)
 HttpClient: SunJSSE 1.8: Sun JSSE provider(PKCS12..
 HttpClient: SunJCE 1.8: SunJCE Provider (implements ..
 HttpClient: SunJGSS 1.8: Sun (Kerberos v5, SPNEGO)
 HttpClient: SunSASL 1.8: Sun SASL provider(implements client mechanisms for: DIGEST-MD5, GSSAPI,..
 HttpClient: SunPCSC 1.8: Sun PC/SC provider
 HttpClient: SunMSCAPI 1.8: Sun's Microsoft Crypto API provider
 DefaultHttpParams: Set parameter http.authentication.preemptive = true
 DefaultHttpParams: Set parameter http.tcp.nodelay = true
 HttpMethodDirector: Preemptively sending default basic credentials
 HttpMethodDirector: Authenticating with BASIC <any realm>@proxy.host.com.au:8080
 HttpMethodParams: Credential charset not configured, using HTTP element charset
 HttpMethodDirector: Authenticating with BASIC <any realm>@localhost:8080
WARN HttpMethodDirector: Required credentials not available for BASIC <any realm>@localhost:8080
WARN HttpMethodDirector: Preemptive authentication requested but no default credentials available
 HttpConnection: Open connection to proxy.host.com.au:8080
 header: >> "GET http://localhost:8080/ HTTP/1.1[\r][\n]"
 HttpMethodBase: Adding Host request header
 header: >> "Proxy-Authorization: Basic KioqKioqKio6KioqKioqKio=[\r][\n]"
 header: >> "User-Agent: Jakarta Commons-HttpClient/3.1[\r][\n]"
 header: >> "Host: localhost:8080[\r][\n]"
 header: >> "Proxy-Connection: Keep-Alive[\r][\n]"
 header: >> "[\r][\n]"
 header: << "HTTP/1.1 403 Forbidden[\r][\n]"
 header: << "HTTP/1.1 403 Forbidden[\r][\n]"
 header: << "Content-Length: 0[\r][\n]"
 header: << "Date: Wed, 27 Jan 2016 04:27:13 GMT[\r][\n]"
 header: << "Via: HTTP/1.1 proxy10705, 1.1 proxy.host.com.au:3128 (Cisco-IronPort-WSA/7.5.1-201)[\r][\n]"
 header: << "Connection: keep-alive[\r][\n]"
 header: << "Proxy-Connection: keep-alive[\r][\n]"
 header: << "[\r][\n]"
 HttpMethodBase: Should NOT close connection in response to directive: keep-alive
 HttpConnection: Releasing connection back to connection manager.
ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[null] returned [403|Forbidden:]
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:336)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:301)

I have added minor formatting to your post to make it more readable. Please do so yourself in the future. Thanks!

@tintin-74 The logs are useful. Not sure what configuration you have used but the REST package should be configured through log4j.

At first glance it looks like the credentials are sent but the proxy (looks to be Cisco Iron Port - Web Security Appliance or WSA) ignores it or discards it (maybe it is incorrect). I'm not familiar with that product so I don't have anything specific to suggest.
How do you typically connect through it to other resources? An HTTP browser example is ideal since this is what ES-Hadoop acts like (REST calls).

Cheers,

Within my web browser i just define the proxy server (address and port) under the Windows LAN Settings dialog - I don't define the username and password as it uses the credentials i'm logged into the machine on.

I set the following VM paramaters within my IntelliJ IDEA

-Dhttp.proxyHost=proxy.host.com.au 
-Dhttp.proxyPort=8080 
-Dhttps.proxyHost=proxy.host.com.au 
-Dhttps.proxyPort=8080 
-Dhttps.proxyUser=username 
-Dhttps.proxyPassword=password

Log file when using TRACE on rest package

16/01/28 16:58:32 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/01/28 16:58:32 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,PROCESS_LOCAL, 2267 bytes)
16/01/28 16:58:32 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/01/28 16:58:32 DEBUG CommonsHttpTransport: Using HTTP proxy [proxy.host.com.au:8080]
16/01/28 16:58:32 INFO CommonsHttpTransport: Using detected HTTP Auth credentials...
16/01/28 16:58:32 TRACE CommonsHttpTransport: Opening HTTP transport to localhost:8080
16/01/28 16:58:32 TRACE CommonsHttpTransport: Tx [HTTP proxy proxy.host.com.au:8080][GET]@[localhost:8080] w/ payload [null]
16/01/28 16:58:32 WARN HttpMethodDirector: Required proxy credentials not available for BASIC @proxy.host.com.au:8080
16/01/28 16:58:32 WARN HttpMethodDirector: Preemptive authentication requested but no default proxy credentials available
16/01/28 16:58:32 TRACE CommonsHttpTransport: Rx [HTTP proxy proxy.host.com.au:8080]@[ipaddress] [403-Forbidden]
16/01/28 16:58:32 TRACE CommonsHttpTransport: Closing HTTP transport to localhost:8080
16/01/28 16:58:32 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on failed; server[null] returned [403|Forbidden:]
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:336)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:301)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:305)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:119)

It looks like you need to setup a proxy for both http and https. ES-Hadoop doesn't configure https yet (should be added) but you also don't seem to be using it.

Two things spring to mind:

  1. use 2.2.0-rc1 which contains several fixes on this front.
  2. pass -Djava.net.useSystemProxies=true to your Spark Job so that the created JVM will pick the proxy settings installed in your system.

I think I found the issue - you have a misconfiguration as both the ES host and the proxy point to the same address. which is incorrect.
The proxy is an intermediate that routes the connection to your ES nodes/host - it's just a layer of indirection.
If the proxy is the host, there's no need to use a proxy in the first place.

Looking at the logs it looks like it's the host that requires the user/pass not the proxy - maybe the confusion comes from the fact that again, proxy == host.