How logstash is working?

I try lo load 500 thousands MySQL rows into Elasticsearch. It takes 10h to do that.
16GB memory, i7 processor; enough resources to fly. All components on one server (no LAN network between).
It takes huuuuge amount of time to do select. Then it starts to load but then stops and waits again for something...
What this software is doing? How to improve performance?

input {
  jdbc {
    jdbc_driver_library => "mysql-connector-java-5.1.47-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/kibanasalesdata?useCursorFetch=true"
    jdbc_user => "root"
    jdbc_password => ""
    jdbc_fetch_size => 4000
    jdbc_paging_enabled => true
    statement => "SELECT some_columns... FROM table"
  }
}
filter{
	date {
		match => ["Order_Date", "yyyy-MM-dd"]
		target => "Order_Date"
	}
	mutate {
		rename => {
			"gps_long" => "[location][lon]"
			"gps_lat" => "[location][lat]"
		}
	}
}

output{
   elasticsearch {
   hosts => ["localhost:9200"] 
   index => "someindex"

}
   stdout {}
}

I really care to get answer from someone who really knows this product (not just his willingness to help without knowledge of the topic).

Regards

I have done quite a bit of development on the jdbc input.

You turned on jdbc_paging_enabled but the default for the page size is config :jdbc_page_size, :validate => :number, :default => 100000
The jdbc_fetch_size you set is 4000 and is set in the driver, meaning it takes 25 trips to the DB before a page of 100000 is built and only once the page is built in LS memory can the events start being added to the work queue. While this is happening the workers are idle.

How many cores to you have? this affects the number of workers LS sets as default. The pipeline batch size is 125 so for 8 workers the in-flight batches is 1000. While the workers are off filtering and outputting these 1000 events, jdbc input has to wait for the workers to collect more events from the queue (effectively a synchronous queue structure).

Outputting to Elasticsearch is also synchronous, meaning a poorly tuned ES cluster can limit LS throughput.

My suggestion is that, to minimise the idle waiting, you need to make sure that jdbc_input is off fetching records while the workers are filtering and outputting i.e. each part is doing something and not waiting on the other. In practice, setting jdbc_fetch_size and jdbc_page_size to workers X 125 might hammer the DB too much - so some multiple of that number could be as good. You may get more throughput if you upped the batch_size to 200 and changed the other numbers accordingly.

Hope this helps.

1 Like

Thank you very much for your time!
I have CPU with 4 cores. So best would be to set:

jdbc_fetch_size` 500, 
jdbc_page_size` 500
pipeline.batch.size: 500
pipeline.workers: 4

?
But this will result in 200 DB queries(!) to complete 100 000 rows (this 100 000 is set for good or you mean since there was no jdbc_page_size then default was 100 000?).

"Pipeline batch" is the no of rows/docs being treated? Logstash is converting data into ES format before it sends it to ES? On the screen I see raw DB rows but since it takes so long time I think logstash is doing something with it...
I think this is logstash problem. In Java monitor I see ES is sleeping.

How to threat xeon threads? Still we talk about no of cores not threads?

Regards

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.