ES indexing rate varies horribly


(Kenny Qiao) #1

Here is one example of my indexing rates for "latest 15 minutes" .

I have 5 data nodes with ES2.1. I've no idea why indexing behaved like this periodically(from ~80,000/s to zero and then back to ~80,000/s).

Is it because of auto throttling?

Thanks.


Performance Help
(Mark Walkom) #2

Could be a bunch of things, like GC or throttling. What else does Marvel tell you?


(Kenny Qiao) #3

Thanks for your response.

I have a look at nodes status marvel told me at that time. 5 host behaved like this:

I also take a look at the gc.log and didn't find any FGC at that time, but found a lot of messages like this:

2016-01-05T22:31:09.779+0800: 10257.403: Total time for which application threads were stopped: 0.0517857 seconds, Stopping threads took: 0.0012074 seconds

Is it because of GC? But how to explain the whole indexing rate varies periodically (~ 1 minute)?

Thanks.


(Kenny Qiao) #4

Also, is it possible to disable auto-throttling in ES2.1.1?


(Kenny Qiao) #5

Any suggestions? I still hit this problem even I double ES machines from 5 to 10.
Could anyone shed some light on this?


(Tin Le) #6

Can you share your elasticsearch.yml? Was anything changed from defaults?

I am seeing something like this in my test 2.1.1 cluster, but not as bad (short periodic) as yours. I think I know what caused mine to happen, like to see your configs first to confirm.

Tin


(Kenny Qiao) #7

Thanks, Tin.

Here are some configurations in my ealsticsearch.yml(removed some common fields like host, gateway):

bootstrap.mlockall: true
node.max_local_storage_nodes: 1

index.merge.scheduler.max_thread_count: 1
index.translog.durability: async
index.translog.flush_threshold_size: 4g

Anything wrong?


(Tin Le) #8

Hmmm, are you zoomed in to 15 minutes time range? If you put the mouse on the graph and note the time period between distinct point on the graph, is it 10s?

I believe you are running into a graphing bug. Marvel polls ES at 10s intervals.

So what you are seeing is not the same problem that I see. I can reproduce your graph by zooming in to 15 minutes time range.

Tin


(Kenny Qiao) #9

Even though I zoom the window to 1 hour, I still see indexing rate varing sharply(marvel pool interval at 10 secs):

What did you find in your testing?

-Kenny


(Tin Le) #10

I get periodic dips in my indexing rate, which I suspect is due to segments merging. The index rate will dropped dramatically periodically. Here is what I mean.

First is the 15 min window to show that I see the same graph you are. It's a bug in Marvel graphing.

The first time I saw this several weeks ago, I thought something went wrong till I zoom back out. As you can see, it show that indexing has stopped for sometime.

The next graph is when I zoomed out 24hrs.

The dip around 2pm is me experimenting with configs.


(Kenny Qiao) #11

Thanks for detail info.

I think I'm experiencing some real performance problems. I still see those dips even in 4 or 24 hours.

Did you have any optimization on merging?


(Tin Le) #12

I've made a number of tweaks, but they are specific to my particular HW/SW environment. Let's hear more about your particular env. Beside the config you posted, any other changes? What are the specifics of your HW/SW?

The cluster I have now is purely for testing purposes. I have lots of physical ES nodes belonging to many clusters, mostly running ES 1.X. I am looking to see what is the best configs for my particular env, and migration planning for my clusters.


(Kenny Qiao) #13

I have 10 boxes and each with 12 cpu w/ HT, 96GB memory, 12 SATA 2T disks.

Other changes for ES is just increase bulk queue_size to 3000.

No more changes.


(Tin Le) #14

Is that 12 SATA disks per box? How are they organized? JBOD? RAID? what level of RAID?

How is your JVM heap? What's ES ES_HEAP_SIZE set to?

When you say bulk queue size, you meant logstash elasticsearch output plugin? What's the average size of your record?

What is the source of your data in? file input? beat? redis? kafka? How many logstash feeding your ES cluster?


(Kenny Qiao) #15

More details:

  • The whole data path is:
    Logstash read data from kafka, then some grok filters and then logstash sends data to Elasticsearch.
    The input and output plugin conf:
    input {
    kafka {
    zk_connect => "host1, host2, host3"
    topic_id => "prod_log"
    auto_offset_reset => "smallest"
    fetch_message_max_bytes => 20971520
    queue_size => 500
    consumer_threads => 3
    }
    }

output {
elasticsearch {
hosts => [host1, host2, host3, ....., host10]
workers => 5
flush_size => 1000
}
}

Each line in log file is stored as one json message in Kafka. The average size is ~5KB。

I have 10 logstash instances and each read 3 partitions from Kafka( I have 30 partitions for my topic).

  • Each box has 12 SATA disks, its just JBOD, no RAID.

  • JVM size of ES is 31g. The bulk queue size I mean is:
    "threadpool" : {
    "bulk" : {
    "queue_size" : "3000"
    }

One thing need to mention is even I double ES boxes from 5 to 10, the indexing rate is almost same. I'm surprised there is no ES indexing performance improvements.


(Jörg Prante) #16

Do not set this to 1. Please leave the default.


(Kenny Qiao) #17

Thanks.

I set that according to https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html;

If you are using spinning media instead of SSD, you need to add this to your elasticsearch.yml:

index.merge.scheduler.max_thread_count: 1

(Tin Le) #18

Thanks for posting details of your config. Are your data source generating enough data? Just wondering if there are periods where there is no(not enough) data to index, hence you get the periodic large drop in indexing rate.

You have more I/O capacity than my current test bed.

Mine is 64G RAM, 12 core HT Xeons, 2x1TB SATA set up as RAID1 and I have 31 physical boxes running ES 2.1.1. I have 33 separate (same HW config) boxes running logstash 2.1.1 using kafka console consumer, consuming 400 topics into 40 indices. I get a sustained 50K/s indexing rate. It goes between peak of 60K/s to low of 40K/s.

My bulk thread queue is on 100 though. I started with everything set to default and then tweaking one thing at a time to see the difference. This took me from before holiday break to now, about 3 weeks of work :slightly_smiling:

At this point, I think I am saturating my disk I/O, but there is still some room for a little bit more in terms of indexing speed. I've managed to get up to 80K/s sustained for a few hours.


(John Sotiropoulos) #19

Hello,

I have encountered the same problem with periodical drops of the indexing rate. When you "zoom out" you just change the interval Marvel uses to calculate the indexing rate.

I am using 2x (kafka & logstash) => 3x ES. All the machines are virtual with 4 cores and 8GB RAM. My goal is to achieve 2000-4000/s indexing rate.

  • Which version of kafka, logstash and Elasticsearch are you using? Because sometimes there are some errors from kafka or the zookeeper and logstash stops (especially when I use the latest version of logstash).

  • Could you share here the configuration you used to increase your indexing rate or at least which settings did you tweak and how did you come up with the right value?

  • I have noticed that if you assign 1 replica for each shard the indexing rate doubles; which means that the indexing rate in Marvel takes into account the indexing of replicas. My question here is how to achieve the same indexing rate without replicas since, as it seems, the cluster is capable of such high indexing rate?

  • Do I need more logstash instances for that or more output workers or more ES instances with more shards? (I don't see any big difference when I assign more output workers!)

Any piece of advice here would be very helpful!

Thank you in advance.


(Tin Le) #20

You need to add more consumers. Are you using logstash kafka input plugin? Have you tried tweaking the number of threads? How many kafka partitions for your topic?

If possible, increase the number of partitions and have one consumer thread per partition in your consumer. You may need to add more logstash instances.

My test bed is using ES 2.2.0, LS 2.1.3 and kafka-console-consumer via the pipe input.

I am still testing and tweaking, not done yet. I intend to write up my experiences when done.

The entire ELK stack changes so fast lately it's unstable (IMHO). There are (too) many knobs to tweak :wink:

Tin