Logstash pipline optimization

abhisheksimion · June 9, 2020, 7:32am

I have multiple configurations defined in the pipelines.yml. All the configurations are responsible for pushing data from a jDBC source to Elasticsearch. I want to know what is the optimal settings required to do so. As sometimes one of the configuration took more than 12 hours.

modified jvm.options

-Xms2g
-Xmx2g

logstash.yml

http.host: x.x.x.x
http.port: 15050
queue.type: persisted
queue.max_bytes: 8gb
queue.checkpoint.writes: 1
path.data: /home/app/logstash-6.2.4/data
path.settings: /home/app/logstash-6.2.4/config
config.reload.automatic: true
config.reload.interval: 5s

pipelines.yml

- pipeline.id: agent
  path.config: "../config/app_config/agent.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: user_dim
  path.config: "../config/app_config/user_dim.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: org
  path.config: "../config/app_config/org.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: task_dim
  path.config: "../config/app_config/task_dim.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: task_run_agg_incr
  path.config: "../config/app_config/task_run_agg_incr.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 500
- pipeline.id: unique_endpoint
  path.config: "../config/app_config/unique_endpoint.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: user_log
  path.config: "../config/app_config/user_log.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000

The following pipelines take 5-10 minutes to fully push data to Elasticsearch
agent, user_dim, org, unique_endpoint

The problem is with the user_log conf, which takes more than 12 hours to complete, any pointers on improving the data load time?

Here is the configuration and SQL query will return 29230868 records

Configuration

input {
	jdbc {
		jdbc_connection_string => "jdbc:oracle:thin:@<host:port>/sname"
		jdbc_user => "xx"
		jdbc_password => "xxxx"
		jdbc_validate_connection => true
		jdbc_driver_library => "/home/app_config/logstash-6.2.4/jdbc_drivers/OJDBC-Full/ojdbc7.jar"
		jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
		schedule => "53 11 * * *"
		statement_filepath =>"/home/app_config/logstash-6.2.4/bin/queries/user_log.sql"
		}
}

output {
	elasticsearch {
	hosts => "<host:port>"
	index => "user_log_index_%{+YYYY_MM_dd}"
	document_id => "%{user_log_key}"
	document_type => "org_user_log"
	}
}

Box info (2 CPU cores, 8GB ram, 70GB HDD)

Rahul_Kumar4 · June 10, 2020, 10:57am

One reason why this could be happening is because you are setting the document_id, so each record insert into the Elasticsearch does a full scan of your index user_log_index_%{+YYYY_MM_dd} to check if there are already existing entries with that id, so as your index size grows, the search before index takes longer.

One way to avoid this problem can be to partition your index into smaller partitions. Right now you are likely indexing everything into one index (assuming YYYY_MM_dd does not change during the course of those 12 hours of indexing time). You could instead use a different indexing strategy, to have multiple smaller indexes so that your index scan is faster. While doing so keep in mind:

Your access patterns (how are you going to query your data).
You do not end up creating a large number of smaller indexes.

The other way could be to remove that document_id setting in the Output if you don't care about having duplicate entries in your Elasticsearch.

abhisheksimion · June 10, 2020, 12:44pm

To answer your question, I am using an alias to query this data. I want to avoid duplicates that is why I have to use document_id

Little info on the data part. This data is of audit log for one of our products, that is why it will be huge. I am thinking to partition the retrieval process by splitting into multiple queries, as given below.

The production system will have a dedicated system for Logstash with 32GB ram, 8 Core CPU and a minimum of 300GB HDD, but is this kind of load still possible with 2 CPU cores, 8GB ram, 70GB HDD?

What is the number that can be said as acceptable number of smaller indexes?

Also, I made further changes after looking here and there, will this increase my chances of getting data faster into Elasticsearch without overwhelming Elasticsearch and crashing either of them?

	input {
	jdbc {
		jdbc_connection_string => "jdbc:oracle:thin:@<host:port>/sname"
		jdbc_user => "xx"
		jdbc_password => "xxxx"
		jdbc_validate_connection => true
		jdbc_driver_library => "/home/app_config/logstash-6.2.4/jdbc_drivers/OJDBC-Full/ojdbc7.jar"
		jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
		schedule => "53 11 * * *"
		statement_filepath =>"/home/app_config/logstash-6.2.4/bin/queries/user_log_pod1.sql"
		jdbc_paging_enabled => true
		jdbc_page_size => 1000
		jdbc_fetch_size => 1000
		clean_run => false
	}
	jdbc {
		jdbc_connection_string => "jdbc:oracle:thin:@<host:port>/sname"
		jdbc_user => "xx"
		jdbc_password => "xxxx"
		jdbc_validate_connection => true
		jdbc_driver_library => "/home/app_config/logstash-6.2.4/jdbc_drivers/OJDBC-Full/ojdbc7.jar"
		jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
		schedule => "53 11 * * *"
		statement_filepath =>"/home/app_config/logstash-6.2.4/bin/queries/user_log_pod2.sql"
		jdbc_paging_enabled => true
		jdbc_page_size => 1000
		jdbc_fetch_size => 1000
		clean_run => false
	}
}
filter {
	 ruby {
		code => "event.set('updatetime_str', event.get('updatetime').time.localtime.strftime('%Y_%m_%d'))"
	}
}
output {
	elasticsearch {
	hosts => "<host:port>"
	index => "user_log_index_%{updatetime_str}"
	document_id => "%{user_log_key}" 
	document_type => "org_user_log"
	}
}

abhisheksimion · June 28, 2020, 6:20am

Open for accepting solutions on the latest Elasticsearch version 7.8 as well.

system · July 26, 2020, 6:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash using multiple configuration Logstash	4	38	March 18, 2025
Help needed for logstash pipeline.yml settings Logstash	2	259	June 17, 2020
Logstash multi pipeline optimal configuration Logstash	1	234	February 9, 2021
Logstash 6.2.2 with JDBC data on not updated Logstash	6	921	May 4, 2018
Recommended performance settings Elasticsearch	2	713	July 1, 2017

Logstash pipline optimization

Related topics