Use of saveToEsWithMeta and performance for idempotence writing


(Vincent Gromakowski) #1

Hi all,
I am looking for a method to index lots of small documents from Spark using a specific ID to manage idempotence and deduplication (case of failure, case of at least once processing...).
The problem is that I get very bad performance with saveToEsWithMeta and my jobs are crashing after 80K docs inserted in a single spark job.
I got very interesting solutions from Costin here: https://github.com/elastic/elasticsearch-hadoop/issues/706

My first question is : does the size of the ID has something to do with writing performance ? So does the type of the ID ?
One possible solution would be to reduce the length of my string or to convert to another format ?

My cluster config is a mesos cluster running one elastic node per bare metal (32 threads 128 GB RAM 16 disks) where we reserved 24 GB for elastic.
My second question : will I get more performance putting 2 nodes on each server, reducing the heap and parallelizing the writes ?


(Costin Leau) #2
  1. IDs are not typically an issue. How different are they compared to the generic ID in ES size wise? I assume you are using some type of UUID - how long is it?

There are several blogs/docs/webinars on ES and performance, including on es.com. For example see this one:
https://www.elastic.co/blog/performance-indexing-2-0

It's not clear how many machines do you in your cluster. ES can easily have multiple nodes on the same machine if the machine is beefy enough - this means not just RAM (notice that we highly recommend leaving half of your RAM to the OS; it helps a lot with IO) but also CPU and storage. In particular with storage, having disks used individually by ES helps since sharing it with other apps will affect performance.


(Vincent Gromakowski) #3

My IDs are : 16 characters + a date time (yyyMMddHHmmss) all in one string

I have 4 ES nodes running on 4 mesos bare metal slaves (32 threads 128 GB RAM 16 disks). One ES node per mesos slave. We didn't put all the RAM in mesos offers to be sure there is still some for OS file cache...
3 disks in RAID0 are dedicated to ES data.


(Costin Leau) #4

So basically your ID is 30 chars which shouldn't be a problem (by the way, dates can be better encoded though it doesn't look like this is what you are after - also not sure what your 16 chars contain but many UUIDs end up encoding the time information in there making the date info redundant).
Some info on the matter here: http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html

It looks like you are sharing resources (in mesos) for your services - you have 3 disks in RAID0 for 4 ES nodes and mesos uses most of the RAM.
This means the 4 nodes will spread their reads and writes across the 3 disks which probably isn't ideal. I realize this might be a result of some constraint but one disk per node is probably better especially since you probably have some indices with replication in there.
Any reason why you are using RAID0? Depending on the implementation (software vs hardware) it might affect the OS performance.

A straight workaround is to minimize the number of bulk requests going to ES at once to avoid the IO lagging too much behind.


(Vincent Gromakowski) #5

Sorry I was not clear, each ES node is running on a mesos slave (1->1) and each ES node has 3 disks of the 16 available on the mesos slave.
We are using software RAID0 but I understand we should change, o you recommend JBOD or one volume per disk and let ES parallelize I/O in each volume ?


(Costin Leau) #6

If the 3 disks are per node, then in that case it should not be a problem, especially if it's RAID0.

Monitor your ES cluster and see why is ES falling behind. You should be able to see a pattern right away.


(system) #7