How to tune Logstash to Elasticsearch shipping

Hi all,

I am writing because I need your help in tuning my LS to ES communication.

What I've done:
I set up a testbed to measure my Logstash performance, the input consists of 4 large log files and my LS instance can reach a rate of 55000 events per second (eps) dropping the output to null. However when I add the Elasticsearch output plugin, which points to 6 ES data nodes reachable via LAN (923 Gb/s and latency 0.15 ms ), I get a drop to 1150 eps with a single output worker (I realized each ES output worker opens 1 socket for each provided ES node).

Is this normal?


Environment:
My LS machine has 24 cores and 8 GB of memory, so it is running with a default of

  • pipeline workers: 24
  • pipeline batch_size: 125

My ES data nodes have 32 GB of memory each and I cannot see any "too many request" error.


I tried to perform some tuning:
Working with the pipeline workers brought no gain, while I managed to have better performances incrementing the ES output workers to 3 (this means my machine opens 18 sockets: 3 sockets open for each ES data node) and the pipeline batch_size to 8000 (to substaint this I had to raise the LS heap to 3 GB and the ES output plugin flush size to 16000, just to be sure).
Maybe further increasing those two factors will bring better values but I don't know if this is the path to follow, since they have some impact on resource allocation. Maybe I am missing something? I didn't try to raise the TCP buffer size on the ES data nodes side, the default should be fair.


Any suggestion? I am a networking guy and quite incompetent on the matter, please help me :slight_smile:

Thanks a lot,
Daniele

1 Like

The ES output will only use the flush size if it is less than the batch size, because each output is only getting a batch at a time. i.e. if the batch size is 100 and the flush size is 50 then the ES output will do two bulk transfers of 50 events.

Try setting your ES output workers to match the pipeline workers.
-w 6 and ES output workers => 6
Try 6, 12, 18 and 24.

Please report back.

1 Like

Hi guyboertje,

thanks for your reply!

I keep the flush size double the size of the pipeline batch size, because in production I may add a filter which will increase the event number on a pipeline thread. Does it make sense?

For what concern -w and elasticsearch output workers you seem to suggest to assign them the same value. I will perform some test, but could you explain why? I have experimented they change different things and should not be related to each other:

  • -w controls the thread number of the pipeline filter/output blocks.
  • elasticsearch output workers controls the number of tcp sockets the output plugin opens for each defined ES host. Since I have 6 ES data nodes every unit increase results in 6 more sockets open.

A first batch of tests follows, as soon as possible I will publish the ones you asked with 6, 12, 18 and 24 elasticsearch output workers.

For all tests the following is kept fixed

-b 8000
elasticsearch output hosts: 6

Test #1

-w 1
ES output workers => 6

{:timestamp=>"2016-06-01T10:21:04.369000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>1, :batch_size=>8000, :batch_delay=>5, :max_inflight=>8000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 4800

Note: the open sockets can be derived with the following command
# netstat -anp | grep 9200 | grep java | grep ESTABLISHED | wc -l 36
and depends by the number of ES hosts defined in the elasticsearch ouput plugin multiplier per the number of ES output workers.

Test #2

-w 2
ES output workers => 6

{:timestamp=>"2016-06-01T10:38:11.138000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>2, :batch_size=>8000, :batch_delay=>5, :max_inflight=>16000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 6650

Test #3

-w 3
ES output workers => 6

{:timestamp=>"2016-06-01T10:51:11.809000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>3, :batch_size=>8000, :batch_delay=>5, :max_inflight=>24000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 9650

Test #4

-w 4
ES output workers => 6

{:timestamp=>"2016-06-01T11:15:26.855000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>4, :batch_size=>8000, :batch_delay=>5, :max_inflight=>32000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 12000

Test #5

-w 5
ES output workers => 6

{:timestamp=>"2016-06-01T11:38:18.161000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>5, :batch_size=>8000, :batch_delay=>5, :max_inflight=>40000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 14000

Test #6

-w 6
ES output workers => 6

{:timestamp=>"2016-06-01T12:06:57.072000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>6, :batch_size=>8000, :batch_delay=>5, :max_inflight=>48000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 15300

Test #7

-w 7
ES output workers => 6

{:timestamp=>"2016-06-01T12:36:01.113000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>7, :batch_size=>8000, :batch_delay=>5, :max_inflight=>56000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 16100

Test #8

-w 8
ES output workers => 6

{:timestamp=>"2016-06-01T12:58:53.128000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>8, :batch_size=>8000, :batch_delay=>5, :max_inflight=>64000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 16800
1 Like

Hi guyboertje and all the people out there,

I performed the test as you asked, setting my ES output workers to match the pipeline workers (could you elaborate on this, please?). I had to reduce my batch size from 4000 to 3000 to keep the heap under control even in the case -w 24.
The max rate I can get is 20000 eps, even if without using the ES output plugin I am able to reach 55000 eps (with default settings of -w 24 -b 125), where is the bottleneck in my case? Would it be the ES data nodes I should get the "too many requests" logs, isn't it?
Thanks a lot for any help :slight_smile:

Following, the results of my testing:

Fixed params

-b 3000
elasticsearch output hosts: 6

Test #1

-w 6
ES output workers => 6

{:timestamp=>"2016-06-06T17:11:11.300000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>6, :batch_size=>3000, :batch_delay=>5, :max_inflight=>18000, :level=>:info}
Logstash open sockets: 36
Logstash events per second: 13300

Test #2

-w 12
ES output workers => 12

{:timestamp=>"2016-06-06T16:52:21.897000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>12, :batch_size=>3000, :batch_delay=>5, :max_inflight=>36000, :level=>:info}
Logstash open sockets: 72
Logstash events per second: 18000

Test #3

-w 18
ES output workers => 18

{:timestamp=>"2016-06-06T16:18:10.515000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>18, :batch_size=>3000, :batch_delay=>5, :max_inflight=>54000, :level=>:info}
Logstash open sockets: 108
Logstash events per second: 20000

Test #4

-w 24
ES output workers => 24

{:timestamp=>"2016-06-06T16:00:40.900000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>24, :batch_size=>3000, :batch_delay=>5, :max_inflight=>72000, :level=>:info}
Logstash open sockets: 144
Logstash events per second: 20000
1 Like

@danygor:

For the current versions of the LS in the ES output workers is the number of output instances in use.
A single ES output is not yet threadsafe so it cannot be shared amongst the worker threads.
Currently we create a "pool" of outputs accessed via a blocking queue - so if ES workers = 1 then there is a pool of one output and only one worker can be communicating with ES at any given time, the other threads must wait.
This is why you don't see much improvement in throughput when increasing the batch count or pipeline workers without a corresponding increase in the ES worker count.

Thanks for completing the tests. We see a plateau of 20k eps at about 16 to 18 workers and outputs.

As far as I know, in ES, the bulk_request http call only returns when a quorum of of active shards indexes the documents.

So 20k eps at 3000 docs per request translates to 6.66667 ES requests per second or 150 ms round trip time.

Now its time to tune ES.

See:
bulk-consistency
performance-considerations