How large are you documents? How fast can you pull from jdbc if you do not send to Elasticsearch? What is the specification of your Elasticsearch cluster? How many shards are you indexing into?
Each Document/Row is having 20 columns, not sure about size
I am using persistent queue of
queue.page_capacity: 64mb
queue.max_bytes: 4gb
I am using jdbc page size of 50000, it take 2 min to fetch data from input and once it comes, it does very small filter operation, but the problem is, it does not index fast enough
Hi @Christian_Dahlqvist, i tested 2 times, i am getting same results that is 4K records in text file every 1 minute
As per my concept i am sending data to ES in 2 nodes, so the output for indexing should be multiple times
But in case of file i was sending data to a single file.
Can you please guide, what setting i need to perform to increase indexing speed
Then it would seem either the filters or the JDBC input is the bottleneck. What filters are you using? Have you tried fewer threads and/or smaller bulk sizes? What happens if you disable all filters and just write to file?
i tried with without filter and output as file, then also i am getting same result approx 4K per minute. I would say little faster close to 4.8-4.9 K
Can you please tell me, if this my concept clear
If there are 2 data nodes and we keep -b size as 2000 then 2000 documents will be flushed in both the nodes in one go and a total of 4000 documents will be indexed ?
Then it seems it is the jdbc query and retrieval of results that is limiting performance. Based on your description I do not see the bottleneck bring Elasticsearch or Logstash filters.
To me looks like problem in jdbc input with combination of oracle
On further investigation I did following
I used the input as MySQL with simple select query no joins and no where clause and kept batch size 5000 working perfectly
Same select query I applied in Oracle jdbc no changes and kept batch size as 5000, but data is getting loaded with size of 100 to 200 batch and this is very strange not sure where to fix this
I gave -b size in command line
I tried with batch size in yml file
While running I checked with monitoring API the batch was showing 5000
But it was not loading in batches of 5000, it was doing with 200.
How can I debug this? as there is no info in log file
Hi @Christian_Dahlqvist, i think i got the another clue, why it is not working. This is checked closely the log file which occurs only when in-flight events are greater than 10 K
I calculate by in-flight events = worker * batch size
Whenever it goes beyond 10K, it does some settings on its own makes it default to 125 batch size i guess, that why i see incremental of 100-125
This is the error which i get
CAUTION: Recommended inflight events max exceeded! Logstash will run with up to 20000 events in memory in your current configuration. If your message sizes are large this may cause instability with the default heap size. Please consider setting a non-standard heap size, changing the batch size (currently 5000), or changing the number of pipeline workers (currently 4)
Do you have any how can set the events max size and where i can set it? i dont see any setting anywhere in documentation
I have 8 Core machine , even setting 4 worker and 3000 batches , does not work and get the similar above error
with default setting, it goes in batches of 50,100 every 1 second
With setting of 2 worker and 4000 batches ( which is not more than 10K in-flight events), i do not get that above WARN, but here also the data goes in batches of 100-125 every second.
It should go in batch of 4000. Just to experiment, i kept my query with limited columns
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.