I am looking for a method to index lots of small documents from Spark using a specific ID to manage idempotence and deduplication (case of failure, case of at least once processing...).
The problem is that I get very bad performance with saveToEsWithMeta and my jobs are crashing after 80K docs inserted in a single spark job.
I got very interesting solutions from Costin here: https://github.com/elastic/elasticsearch-hadoop/issues/706
My first question is : does the size of the ID has something to do with writing performance ? So does the type of the ID ?
One possible solution would be to reduce the length of my string or to convert to another format ?
My cluster config is a mesos cluster running one elastic node per bare metal (32 threads 128 GB RAM 16 disks) where we reserved 24 GB for elastic.
My second question : will I get more performance putting 2 nodes on each server, reducing the heap and parallelizing the writes ?