Unable to resolve ip for complex ES hostname

kkshah · February 27, 2020, 9:29pm

I have an Elasticsearch URL with embedded username/password that looks something like this:

https://USERNAME:PASSWORD@ADDRESS/some/path/to/elasticsearch

Of course, I have no issues submitting queries to this address using CURL (using default port, 443). However, I am attempting to use elasticsearch-hadoop with PySpark to fetch data from this URL, and it seems to be resolving the hostname to just USERNAME

Steps to reproduce

Code:

import pyspark
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /path/to/elasticsearch-hadoop-7.6.0/dist/elasticsearch-spark-20_2.11-7.6.0.jar pyspark-shell'

spark = SparkSession.builder.getOrCreate()

es_read_conf = {
    "es.nodes" : "https://USERNAME:PASSWORD@ADDRESS/some/path/to/elasticsearch
",
    "es.port" : "443",
    "es.resource" : "reviews/basic"
}

es_rdd = spark.sparkContext.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_read_conf)

Strack trace:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot resolve ip for hostname: USERNAME
	at org.elasticsearch.hadoop.util.SettingsUtils.resolveHostToIpIfNecessary(SettingsUtils.java:84)
	at org.elasticsearch.hadoop.util.SettingsUtils.qualifyNodes(SettingsUtils.java:46)
	at org.elasticsearch.hadoop.util.SettingsUtils.declaredNodes(SettingsUtils.java:142)
	at org.elasticsearch.hadoop.util.SettingsUtils.discoveredOrDeclaredNodes(SettingsUtils.java:148)
	at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:64)
	at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:58)
	at org.elasticsearch.hadoop.rest.RestClient.<init>(RestClient.java:101)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:327)
	at org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:220)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:414)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:395)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:130)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
	at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:239)
	at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:302)
	at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Version Info

OS: : Mac OS Mojave 10.14.6
JVM :

openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b04)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b04, mixed mode)

Hadoop/Spark: Spark 2.4.4
ES-Hadoop : 7.6.0
ES : 6.8.1

Any help would be appreciated!

Edit

I've also tried this config:

es_read_conf = {
    "es.nodes" : "https://ADDRESS:443/path/to/elasticsearch",
    "es.resource" : "reviews/basic",
    "es.net.http.auth.user": "USERNAME",
    "es.net.http.auth.pass": "PASSWORD"
}

Where I get this error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
	at org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:220)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:414)
	at org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:395)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:130)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
	at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:239)
	at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:302)
	at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[https://RESOLVED_IP:PORT/path/to/elasticsearch] returned [404|Not Found:]
	at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:477)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:434)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:388)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:392)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:168)
	at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:745)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:330)
	... 32 more

I noticed that for some reason it is trying to resolve the IP of the hostname provided, which will in fact return a 404. I tried with the es.nodes.wan.only": "true" option, and now it is using the hostname. However, there is still a 404 for some reason:

Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[https://ADDRESS:PORT/path/to/elasticsearch] returned [404|Not Found:]

system · March 26, 2020, 9:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
403 Forbidden when using '"es.nodes.path.prefix" pyspark 8.4.1 pyspark 3.3 Elasticsearch es-hadoop	9	631	October 25, 2022
Unable to connect to ES Cluster on AWS Elasticsearch Service through Hadoop Elasticsearch es-hadoop	15	10046	July 6, 2017
Connecting elastic search through pyspark Elasticsearch es-hadoop	2	1701	December 7, 2021
[Hadoop] Cannot discover Elasticsearch version exception Elasticsearch	1	467	July 6, 2017
Problems connecting to ES from Databricks using spark connector Elasticsearch es-hadoop	3	690	June 5, 2023

Unable to resolve ip for complex ES hostname

Steps to reproduce

Version Info

Edit

Related topics