I try lo load 500 thousands MySQL rows into Elasticsearch. It takes 10h to do that.
16GB memory, i7 processor; enough resources to fly. All components on one server (no LAN network between).
It takes huuuuge amount of time to do select. Then it starts to load but then stops and waits again for something...
What this software is doing? How to improve performance?
I have done quite a bit of development on the jdbc input.
You turned on jdbc_paging_enabled but the default for the page size is config :jdbc_page_size, :validate => :number, :default => 100000
The jdbc_fetch_size you set is 4000 and is set in the driver, meaning it takes 25 trips to the DB before a page of 100000 is built and only once the page is built in LS memory can the events start being added to the work queue. While this is happening the workers are idle.
How many cores to you have? this affects the number of workers LS sets as default. The pipeline batch size is 125 so for 8 workers the in-flight batches is 1000. While the workers are off filtering and outputting these 1000 events, jdbc input has to wait for the workers to collect more events from the queue (effectively a synchronous queue structure).
Outputting to Elasticsearch is also synchronous, meaning a poorly tuned ES cluster can limit LS throughput.
My suggestion is that, to minimise the idle waiting, you need to make sure that jdbc_input is off fetching records while the workers are filtering and outputting i.e. each part is doing something and not waiting on the other. In practice, setting jdbc_fetch_size and jdbc_page_size to workers X 125 might hammer the DB too much - so some multiple of that number could be as good. You may get more throughput if you upped the batch_size to 200 and changed the other numbers accordingly.
?
But this will result in 200 DB queries(!) to complete 100 000 rows (this 100 000 is set for good or you mean since there was no jdbc_page_size then default was 100 000?).
"Pipeline batch" is the no of rows/docs being treated? Logstash is converting data into ES format before it sends it to ES? On the screen I see raw DB rows but since it takes so long time I think logstash is doing something with it...
I think this is logstash problem. In Java monitor I see ES is sleeping.
How to threat xeon threads? Still we talk about no of cores not threads?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.