Spark job is failing with authenticating with BASIC error

(diplomaticguru) #1

I have Elasticsearch cluster that has Basic HTTP authentication enabled. So, in my spark configuration I set the following parameters as described in the documentation:


However, when I executed the spark job in my yarn-cluster, I'm getting this error:

httpclient.HttpMethodDirector: Failure authenticating with BASIC 'Elasticsearch cluster read/write'
15/07/16 17:22:23 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 25) [GET] on [_nodes/transport] failed; server[null] returned [401|Unauthorized:]
    at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
    at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEsWithMeta$1.apply(EsSpark.scala:86)
    at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEsWithMeta$1.apply(EsSpark.scala:86)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.executor.Executor$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$

But when I use the curl command to check the index with the user/pass, it works fine:
curl -u testes:test123 -XGET

Please let me know what am I doing incorrectly?

(diplomaticguru) #2

Okay, so I did my own investigation and I found out what the problem is but still need your help to resolve the issue.

When I checked the ES_Hadoop source-code, I found the error is being thrown when discoverNodes() method is called from RestClient class. This method is trying to GET nodes details by calling this endpoint "_nodes/transport". However, the problem is that user testes:test123 does not have admin privilege. Therefore, it is throwing that error.

I tried to get the "_nodes/transport details using curl and it failed with below error as expected:

 curl -u testes:test123 -XGET
<title>401 Authorization Required</title>
<h1>Authorization Required</h1>
<p>This server could not verify that you
are authorized to access the document
requested.  Either you supplied the wrong
credentials (e.g., bad password), or your
browser doesn't understand how to supply
the credentials required.</p>

The user account that I'm using has only access to a specific index (we don't want them to access everything), so it will not be able to access "_nodes/transport". Not sure what I could do other than granting admin privilege to the account, which I don't want to. Any suggestions?

(Costin Leau) #3

What system are you using for securing the cluster? The connector needs information about the index topology in order to access the nodes/shards directly - without getting access to the nodes, it cannot do any discovery (even when using the client-only option).

(diplomaticguru) #4

@costin, It's our own custom solution using an apache proxy/redirector in order to have per index authentication. Initially, we access the search node, apache will first authenticate and then forward the request to ES. With this authentication in place I don't think the current es-hadoop lib will work as expected, unless it is customised!

(diplomaticguru) #5

@costin, don't worry about this issue, we've granted our user to access _node.

Many thanks.

(Costin Leau) #6

Glad to hear things were sorted out.

(system) #7