Hello, I'm using spark SQL in order to extract data from elastic into csv files.
used software versions :
elastic 5.5.2
spark SQL 2.1.1
elasticsearch-spark-20_2.11 5.5.2
scala 2.11.8
I query elastic using alias.
When the alias is updated (remove and add new indices) during spark job processing, this last one fails.
Does it possible to preserve the same index from the beginning to end of the spark job execution (atomic process) ?
My workaround is searching the indices associated to the alias at the beginning of the spark job and initialize the dataframe with these indices.
Does it exist a better solution ?
At the start of the job the connector should discover the active search shards for an alias, each of which should contain the underlying index name that they belong to. After this index name is pulled, we apply any alias metadata to the scroll request for that shard when the read task is started. While the scroll request should continue to operate against the index name only, the rest of the job will continue to make calls to Elasticsearch via the provided alias.
This sort of situation is tough to manage since aliases provide quite a bit of functionality that would otherwise be transparent to the client, so we try and take advantage of the given alias as much as possible. That said, if you can provide some steps for reproducing this issue, we can look into improving how the connector handles aliases.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.