Indexing performance terrible after upgrading from 1.6 to 2.4

Yesterday I upgraded my cluster from 1.6 to 2.4. For some reason, indexing performance is awful.

My pipeline is {bro nms logs / syslogs / windows logs} -> redis -> logstash -> elasticsearch.

The only thing I upgraded was elasticsearch. I upgraded logstash to the 2.x branch months ago, and everything has been fine.

With ES 1.6, I could get an indexing rate of about 13k events per second, running on 8 data nodes on dedicated hardware, each with a dedicated 1 TB disk for data, 32gb of ram (16 given to ES), and 4 processor cores, and 3 master nodes running in VMs. In the past, ES was CPU bound, with my CPUs maxed out.

On ES 2.4, I can only get about 800 events per second, and it looks like it's disk bound now, with the IO lights lit up solid, and the CPUs only running about 20%.

I've tried playing around with the logstash output settings, adjusting the flush_size and number of worker threads, which got me up to 800 events per second, from only 300 events per second. Beyond that, I can't seem to get any more throughput. If I can't fix this, I'm going to have to blow away my cluster and reinstall 1.6 and restore my backup.

Anybody got any ideas?

Usually slow indexing after upgrading to 2.x comes from the change of the default of index.translog.durability from async to request so the translog is fsynced on every request. If you weren't using _bulk before the performance on upgrade will be terrible. With fairly large _bulk inserts we saw a 7% percent performance hit. But if you are using logstash it should be using bulks. You can test if this is the issue by dynamically changing that setting on the index you are writing to. If it makes performance much better I'd investigate bulk sizing somehow.

Beyond that I don't remember anything specific. You can usually get some interesting information by doing a hot_threads request and having a look or posting the output here.