Reduced bulk performance in 2.3.5 versus 1.3?

ChrisO · August 18, 2016, 5:03pm

Hi all,
Like many others, we're seeing a large difference in bulk indexing performance between our old 1.3 ES and our new 2.3.5 ES cluster. I've followed a bunch of threads, but many don't report back with what their solution was.

We have 10x data nodes, and 3x master nodes.
Data nodes have 24GB ram, 12 allotted to JVM
There's about 1TB of data, documents are around 10-50kb each, depending on type.

We use not analyzed for many fields, but not all.
We use custom routing on our documents, but one bulk request can contain documents with routing for many shards
We use one replica. We have not tried setting replicas to 0, and then upping it after bulk process is done. This causes cluster to go yellow, which alerts our monitors. So we'd like to avoid this if possible.
We do set translog.durability = async and are aware of consequences.
We use SSDs and set the tuning parameters inline with the ES tuning blog.
We run marvel on a separate ES instance.
We have removed all search traffic from this, and indexing does not improve.
This is one large index, and not time bucketed data.
I/O is super low (<10MB/sec)

We have tried varying bulk sizes from 500 -> 3000 documents.
We have tried varying our number of shards between 5 and 200
We've tried to vary the size of the bulk threads.
We've updated refresh_interval to -1 and 15m with no real change
GCs are not thrashing, memory looks good
Other various things from the perf blog were tried (merge throttling = none, refresh = -1, index translog flush = 1GB, etc. )
We do see that our bulk threadpools remain full, and some queueing happens. Seems like one bulk request spawns multiple threads internally. We only have 20 threads submitting bulk requests, and this fills our active threads across all nodes (we have tried reducing this as well).

With 1.3, we saw 12k indexes per sec
With 2.3, we're around 7k/sec

At this point, I'm drawing a blank as to what to try next, other than throwing more hardware at it.

Any ideas what I might be forgetting? What other info would be helpful to diagnose?

Thanks!
Chris

eperry · August 19, 2016, 1:17am

Hi Chris, you may want to read though my on going thread, I have a very simular issue and have almost identical Use case that you do.

I have no answers yet, I also came across this other page

Logstash Indexer to Elastcsearch Tunnings ( must go faster!)

Here is the link

Under Integrations
https://www.elastic.co/guide/en/elasticsearch/plugins/2.3/integrations.html

Here is the product
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0

jprante · August 19, 2016, 7:02am

The reason is the new translog durability. This means ES executes much more IOPS (esp. fsync calls to the file system) and the operating system must handle this on the block device layer. Maybe you run ES on a virtual machine and the guest/host I/O channel is not configured well? Check file system setup options, network parameter, and I/O elevator settings for maximum throughput.

jprante · August 19, 2016, 7:03am

What server hardware and operating system do you use?

ChrisO · August 19, 2016, 8:31pm

Thanks, will check VM I/O settings

We are indeed using VMs, and running centos 6

eperry · August 22, 2016, 12:41am

Chris, I am also just figuring out the new Logstash 2.3 pipeline and it seems that

The old method of allocating workers and queues are not great for tunning any more

In my config , I thought 2 partitions in kafka was enough for each of the logstash agents I had. But I never got the Logstash to pull nor index faster then 1000 m/s and Logstash never pulled data faster than what got written to Kafka but when I added say 3 times more partitions all of a sudden logstash was pulling more data then I was writing. (The Kafka input auto scales to cover the number of partitions you have for your topic)

Also the number of workers and flush size don't seem to work like the did. You have to also look at the -w and -b options for logstash

https://www.elastic.co/guide/en/logstash/current/pipeline.html#_execution_model

when I started to play with these options, Indexing started to increase, though I have not figured out a formula yet

eperry · August 22, 2016, 1:02am

Also I set some of the options that I read about, in some of the tuning guides, I would read up on each of the options before using. I am not sure how important your data is to you. But I am a little more cowboy with mine. (Right now)

#!/bin/bash
#DEBUG enable next line
#set -x
HOSTNAME=A Server
PORT="9200"
curl -XPUT "$HOSTNAME:$PORT/_cluster/settings" -d '
{
        "transient" : {
		"indices.store.throttle.type" : "none",
		"index.translog.flush_threshold_size" : "2048mb",
		"index.translog.durability" : "async",
		"index.merge.scheduler.max_thread_count" : 20,
		"index.refresh_interval" : "60s"
        }
 }'

Topic		Replies	Views
Problems with performance in ElasticSearch 2.3.2 Elasticsearch	5	1478	July 5, 2017
Elasticsearch poor indexing performance Elasticsearch	6	848	December 1, 2017
ElasticSearch high CPU on merge threads Elasticsearch	8	2593	July 5, 2017
Elasticsearch 2.3 poor performance Elasticsearch	24	3271	July 5, 2017
Degraded Indexing Performance on v7.3.1 (from v5.6.10) Elasticsearch	6	406	March 27, 2020

Reduced bulk performance in 2.3.5 versus 1.3?

Related topics