I have a simple elasticsearch query that I want to receive that count of. On my local mac environment, the below setup works as expected.
On an Ubuntu machine, I see ERROR NetworkClient:144 - Node [10.0.0.1:9200] failed
(also odd, because the configuration points to 127.0.0.1).
Environment that does not work:
OS: Ubuntu
Python: 2.7.10
Spark: 2.3.1 and 2.3.2
ElasticSearch: 5.6.5
ES Hadoop Jar: elasticsearch-spark-20_2.11-5.6.5.jar
Environment that does work:
OS: mac
Python: 2.7.10
Spark: 2.2.1, 2.3.1, and 2.3.2
ElasticSearch: 5.6.3
ES Hadoop Jar: elasticsearch-spark-20_2.11-5.6.3.jar
PySpark is run from the interactive shell via
path/to/pyspark \
--jars /path/to/elasticsearch-spark-20_2.11-5.6.5.jar
--master local
The code I run is as follows:
# I put the timeout to 10 seconds to see if I could get the error quicker, but it appears to still wait the full 1 minute (which is the default as per the docs)
config = {
'es.nodes': '127.0.0.1:9200',
'es.scroll.size': '9000',
'es.resource': 'mydata/data',
'es.query': '{"query": {"match_all": {}}}',
'es.http.timeout': '10s'}
dataframe = sqlContext.read.format("org.elasticsearch.spark.sql").options(**config).load()
dataframe.count()
# waits for about a minute then shows the below error, then retries with 127.0.0.1 and errors out again
# ERROR NetworkClient:144 - Node [10.0.0.1:9200] failed (Connection timed out (Connection timed out)); selected next node [127.0.0.1:9200]