Parallel bulk indexing performance

Hi,

I wrote some code with pyelasticsearch to run bulk indexing for a list of 110 million flat html pages + metadata, I'm not sure what to expect with regards to performance but here's the environment;

12 nodes, 6 of these http://www.ovh.com/us/dedicated-servers/eg_64g.xml and 6 of these http://www.ovh.com/us/dedicated-servers/kimsufi.xml

The method I am using to index is workers that receive URLs from a central ZMQ server which evenly distributes them across the requests, the pages are then fetched and every 200 pages in memory the worker sends a bulk index request to the cluster. The workers are only running on the high powered machines so the smaller machines are effectively dedicated ES boxes only. Workers are automatically spawned continuously but if the write throughput for a process ever drops below 10mbps that worker terminates directly after that write is completed. Effectively at the moment this has resulted in an average worker run per large server of about 25 at any given point in time, 2 new workers are spawned every 10 seconds per server.

I have setup an index with 12 shards and turned of replication for now to maximise throughput, I have also disabled indexing until the bulk load is fully completed, but I am still getting an average of no higher than 3k pages per second being indexed. Looking at iftop on the entire swarm it appears to me that none of the machines are anywhere near saturating the IO pipeline so I'm not sure exactly what the bottleneck would be.

I have installed the paramedic plugin on the cluster and I notice that the load average distribution amongst the systems is quite variable, and it's not an even split like I would predict between the high and lower powered servers. Looking now for example I see;

(1-6 high powered) 1; 0.68 2; 1.28 3; 1.2 4; 0.6 5; 1.03 6; 2.46
(7-12 low powered) 7; 5.41 8; 3.53 9; 5.94 10; 4.37 11; 1.070 12; 4.940

Have already set the heap size for all ES nodes to 3gb, here is a sample config file;

index.mapping.ignore_malformed: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 3s
discovery.zen.ping.unicast.hosts: ["matrix1","matrix2","matrix3","matrix4","matrix5","matrix6","matrix7","matrix8","matrix9","matrix10","matrix11","matrix12"]

discovery.zen.ping.multicast.enabled: false
cluster.name: meanpath
node.name: "matrix4"
path.data: /home/elasticsearch
index.cache.field.type: soft
index.cache.field.max_size: 1000000

The only other difference between the configs is that on the lower powered servers instead of using a single directory with RAID10 as in the higher powered server configs, I used two separate directories on normal disks hoping to reduce overhead from software raid.

Is the performance I'm seeing pretty much what would be expected, anything obvious to change? Any advice appreciated.

Regards
Eric