Catching exceptions from saveToEs (elasticsearch-spark)


#1

Hello,

I am writing an RDD to Elasticsearch using the saveToEs method from elasticsearch-spark. The RDD might contain documents that Elasticsearch rejects with a org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest exception, and I would like to catch the exception(s) in order to just ignore such malformed documents, so that the job does not get interrupted. How can I do this?


(James Baiera) #2

Right now there's no great way to do this in es-hadoop. I do recommend using the functionality provided by Spark's RDDs to transform or filter out any invalid documents before executing the final saveToEs. It's unlikely that we would provide options to filter out data when those options are already present in these frameworks.


#3

Well, if there's a failure on the Elasticsearch side, I'd like to be able to fail gracefully - and not to have my whole job fail. In my specific case there is no simple way to do the checks beforehand, so handling the exception would be easier. Do you see any solutions to this? Maybe a saveToEs parameter to "ignore" the exceptions and log them somewhere?


How to handle data that causes failure while indexing from spark to ES
(Ravi L ) #4

I have also faced some issues with this. The JSON was just fine, it was an invalid date in my case. I had to look at the Elasticsearch logs to find out. I do need a way to run the job and just store errors in a different place so they can be re-run later.


#5

I'd be also interested in this feature. It poses some limitations because it will fail the entire job


(system) #6