I am having trouble testing some of the code from my new book, Agile Data Science 2.0. I am writing from PySpark to Elasticsearch and keep running into an error.
This is local Spark on one node with local elasticsearch, as these are examples in a book. This is running on a r4.xlarge EC2 instance on Ubuntu. The Parquet data is 155MB.
The script is here: ch04/pyspark_to_elasticsearch.py
It looks like:
# Load the parquet file on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') on_time_dataframe.repartition(1).write.format("org.elasticsearch.spark.sql")\ .repartition(1)\ .option("es.resource","agile_data_science/on_time_performance")\ .mode("overwrite")\ .save()
Note that I added the call to repartition to try to throttle the work. After a few minutes, I get this error: https://gist.github.com/rjurney/ec0d6b1ef050e3fbead2314255f4b6fa
The take home message is:
[agile_data_science] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [agile_data_science] containing  requests]
What can I do to make this work? It looks like Elasticsearch is getting overloaded, but this is one Spark partition so I don't know how to throttle it further. Note that some records are getting written, I can search them afterwards but they don't all make it.
Would adding more shards help? I just don't know. Any suggestions would be appreciated.