ES output plugin is slow in indexing even with 2 data nodes and 1 master node

Hi All

I have the basic configuration where my input and JDBC and output is ES.
I have 1 Master node and 2 Data Nodes

Case 1 : I am indexing data using ES output plugin by giving host name for both the index
hosts => ["host1","host2"]

I am able to create index perfectly, but the speed is very slow.
I am able too ingest only 200 records per second

Case II : On top of Case 1 , i have added configuration like

thread_pool.index.size: 3
thread_pool.index.queue_size: 1000

thread_pool.bulk.size: 3
thread_pool.write.queue_size: 3000

even then the speed is same like 200 records per second

Can anyone tell what is the efficient way of doing Bulk indexing with LS.
I assume LS output plugin ES does Bulk indexing.

I have also done settings like
indices.memory.index_buffer_size: 30%

To me this looks like it is problem with ES output plugin

Few more updates

I am using -w 48 (workers)
and batch size of 1250 ( batch size)

How can increase my indexing speed to say 40K records per minute with LS?

How large are you documents? How fast can you pull from jdbc if you do not send to Elasticsearch? What is the specification of your Elasticsearch cluster? How many shards are you indexing into?

Each Document/Row is having 20 columns, not sure about size
I am using persistent queue of

queue.page_capacity: 64mb
queue.max_bytes: 4gb

I am using jdbc page size of 50000, it take 2 min to fetch data from input and once it comes, it does very small filter operation, but the problem is, it does not index fast enough

Master Node : 4 Core , 15 GB RAM
Data Node1 : 8 Core, 30 GB RAM
Data Node2 : 8 Core, 30 GB RAM

Have huge disk space in all machine and all individual servers with good SSD

I have to index 10 mn records / documents

I am using ES and LS ver 6.8

I am using 10 shards and kept index.refresh_interval : 30s

Test what the throughput is with another output, e.g. file, so you can verify Elasticsearch is the bottleneck.

Hi @Christian_Dahlqvist, i tested 2 times, i am getting same results that is 4K records in text file every 1 minute
As per my concept i am sending data to ES in 2 nodes, so the output for indexing should be multiple times
But in case of file i was sending data to a single file.

Can you please guide, what setting i need to perform to increase indexing speed

Indexing throughput will only increase if you have multiple shards and no replica configured when you add a second node.

What is the indexing throughput when you instead write to file? How long does it take to process 4K documents from start to finish?

Yes, since i am using 2 nodes, i have created a index dummy
Added 10 shards and 0 replica by using PUT curl command

and set the refresh interval to 0

In output as a file, it take 1 min
In output as a ES, also take approx 1 min

In both the cases it outputs 4K document

Then it would seem either the filters or the JDBC input is the bottleneck. What filters are you using? Have you tried fewer threads and/or smaller bulk sizes? What happens if you disable all filters and just write to file?

i tried with without filter and output as file, then also i am getting same result approx 4K per minute. I would say little faster close to 4.8-4.9 K

Can you please tell me, if this my concept clear
If there are 2 data nodes and we keep -b size as 2000 then 2000 documents will be flushed in both the nodes in one go and a total of 4000 documents will be indexed ?

Then it seems it is the jdbc query and retrieval of results that is limiting performance. Based on your description I do not see the bottleneck bring Elasticsearch or Logstash filters.

Okay, I will check on that
Can you also brief if my above concept is correct?
If more the number of nodes the better is ingestion rate.

You typically scale out indexing ghroughput by adding nodes, but that naturally assumes there is no bottleneckvelsewhere.

I also experience performance issues with the jdbc input plugin. The database I need to connect to is MSSQL. Also tried changing the fetch size etc, no luck. Have you tried to use collumn tracking on lets say ‘id’? Maybe that helps in your case. https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_predefined_parameters

To me looks like problem in jdbc input with combination of oracle

On further investigation I did following
I used the input as MySQL with simple select query no joins and no where clause and kept batch size 5000 working perfectly

Same select query I applied in Oracle jdbc no changes and kept batch size as 5000, but data is getting loaded with size of 100 to 200 batch and this is very strange not sure where to fix this

I gave -b size in command line
I tried with batch size in yml file
While running I checked with monitoring API the batch was showing 5000

But it was not loading in batches of 5000, it was doing with 200.

How can I debug this? as there is no info in log file

Did your original query against the Oracle DB have the same structure without joins?

Yes same structure without joins. It's just that I replicated from Oracle to mysql

If you are not comparing exactly the same query on exactly the same dataset i do not see how you can draw any conclusions from it.

Hi @Christian_Dahlqvist, i think i got the another clue, why it is not working. This is checked closely the log file which occurs only when in-flight events are greater than 10 K

I calculate by in-flight events = worker * batch size
Whenever it goes beyond 10K, it does some settings on its own makes it default to 125 batch size i guess, that why i see incremental of 100-125

This is the error which i get

CAUTION: Recommended inflight events max exceeded! Logstash will run with up to 20000 events in memory in your current configuration. If your message sizes are large this may cause instability with the default heap size. Please consider setting a non-standard heap size, changing the batch size (currently 5000), or changing the number of pipeline workers (currently 4)

Do you have any how can set the events max size and where i can set it? i dont see any setting anywhere in documentation

I have 8 Core machine , even setting 4 worker and 3000 batches , does not work and get the similar above error

What level of performance do you get with the default options?

with default setting, it goes in batches of 50,100 every 1 second

With setting of 2 worker and 4000 batches ( which is not more than 10K in-flight events), i do not get that above WARN, but here also the data goes in batches of 100-125 every second.

It should go in batch of 4000. Just to experiment, i kept my query with limited columns

select name, description from volt_meter