Unable to resolve ip for complex ES hostname

I have an Elasticsearch URL with embedded username/password that looks something like this:

https://USERNAME:PASSWORD@ADDRESS/some/path/to/elasticsearch

Of course, I have no issues submitting queries to this address using CURL (using default port, 443). However, I am attempting to use elasticsearch-hadoop with PySpark to fetch data from this URL, and it seems to be resolving the hostname to just USERNAME

Steps to reproduce

Code:

import pyspark
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /path/to/elasticsearch-hadoop-7.6.0/dist/elasticsearch-spark-20_2.11-7.6.0.jar pyspark-shell'

spark = SparkSession.builder.getOrCreate()

es_read_conf = {
    "es.nodes" : "https://USERNAME:PASSWORD@ADDRESS/some/path/to/elasticsearch
",
    "es.port" : "443",
    "es.resource" : "reviews/basic"
}

es_rdd = spark.sparkContext.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_read_conf)

Strack trace:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot resolve ip for hostname: USERNAME
	at org.elasticsearch.hadoop.util.SettingsUtils.resolveHostToIpIfNecessary(SettingsUtils.java:84)
	at org.elasticsearch.hadoop.util.SettingsUtils.qualifyNodes(SettingsUtils.java:46)
	at org.elasticsearch.hadoop.util.SettingsUtils.declaredNodes(SettingsUtils.java:142)
	at org.elasticsearch.hadoop.util.SettingsUtils.discoveredOrDeclaredNodes(SettingsUtils.java:148)
	at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:64)
	at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:58)
	at org.elasticsearch.hadoop.rest.RestClient.<init>(RestClient.java:101)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:327)
	at org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:220)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:414)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:395)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:130)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
	at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:239)
	at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:302)
	at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Version Info

OS: : Mac OS Mojave 10.14.6
JVM :

openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b04)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b04, mixed mode)

Hadoop/Spark: Spark 2.4.4
ES-Hadoop : 7.6.0
ES : 6.8.1

Any help would be appreciated!

Edit

I've also tried this config:

es_read_conf = {
    "es.nodes" : "https://ADDRESS:443/path/to/elasticsearch",
    "es.resource" : "reviews/basic",
    "es.net.http.auth.user": "USERNAME",
    "es.net.http.auth.pass": "PASSWORD"
}

Where I get this error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
	at org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:220)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:414)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:395)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:130)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
	at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:239)
	at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:302)
	at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[https://RESOLVED_IP:PORT/path/to/elasticsearch] returned [404|Not Found:]
	at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:477)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:434)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:388)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:392)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:168)
	at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:745)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:330)
	... 32 more

I noticed that for some reason it is trying to resolve the IP of the hostname provided, which will in fact return a 404. I tried with the es.nodes.wan.only": "true" option, and now it is using the hostname. However, there is still a 404 for some reason:

Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[https://ADDRESS:PORT/path/to/elasticsearch] returned [404|Not Found:]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.