Logstash pipline optimization

I have multiple configurations defined in the pipelines.yml. All the configurations are responsible for pushing data from a jDBC source to Elasticsearch. I want to know what is the optimal settings required to do so. As sometimes one of the configuration took more than 12 hours.

modified jvm.options

-Xms2g
-Xmx2g

logstash.yml

http.host: x.x.x.x
http.port: 15050
queue.type: persisted
queue.max_bytes: 8gb
queue.checkpoint.writes: 1
path.data: /home/app/logstash-6.2.4/data
path.settings: /home/app/logstash-6.2.4/config
config.reload.automatic: true
config.reload.interval: 5s

pipelines.yml

- pipeline.id: agent
  path.config: "../config/app_config/agent.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: user_dim
  path.config: "../config/app_config/user_dim.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: org
  path.config: "../config/app_config/org.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: task_dim
  path.config: "../config/app_config/task_dim.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: task_run_agg_incr
  path.config: "../config/app_config/task_run_agg_incr.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 500
- pipeline.id: unique_endpoint
  path.config: "../config/app_config/unique_endpoint.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000
- pipeline.id: user_log
  path.config: "../config/app_config/user_log.conf"
  pipeline.workers: 3
  queue.type: persisted
  pipeline.batch.size: 1000

The following pipelines take 5-10 minutes to fully push data to Elasticsearch
agent, user_dim, org, unique_endpoint

The problem is with the user_log conf, which takes more than 12 hours to complete, any pointers on improving the data load time?

Here is the configuration and SQL query will return 29230868 records

Configuration

input {
	jdbc {
		jdbc_connection_string => "jdbc:oracle:thin:@<host:port>/sname"
		jdbc_user => "xx"
		jdbc_password => "xxxx"
		jdbc_validate_connection => true
		jdbc_driver_library => "/home/app_config/logstash-6.2.4/jdbc_drivers/OJDBC-Full/ojdbc7.jar"
		jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
		schedule => "53 11 * * *"
		statement_filepath =>"/home/app_config/logstash-6.2.4/bin/queries/user_log.sql"
		}
}

output {
	elasticsearch {
	hosts => "<host:port>"
	index => "user_log_index_%{+YYYY_MM_dd}"
	document_id => "%{user_log_key}"
	document_type => "org_user_log"
	}
}

Box info (2 CPU cores, 8GB ram, 70GB HDD)

One reason why this could be happening is because you are setting the document_id, so each record insert into the Elasticsearch does a full scan of your index user_log_index_%{+YYYY_MM_dd} to check if there are already existing entries with that id, so as your index size grows, the search before index takes longer.

One way to avoid this problem can be to partition your index into smaller partitions. Right now you are likely indexing everything into one index (assuming YYYY_MM_dd does not change during the course of those 12 hours of indexing time). You could instead use a different indexing strategy, to have multiple smaller indexes so that your index scan is faster. While doing so keep in mind:

  1. Your access patterns (how are you going to query your data).
  2. You do not end up creating a large number of smaller indexes.

The other way could be to remove that document_id setting in the Output if you don't care about having duplicate entries in your Elasticsearch.

To answer your question, I am using an alias to query this data. I want to avoid duplicates that is why I have to use document_id

Little info on the data part. This data is of audit log for one of our products, that is why it will be huge. I am thinking to partition the retrieval process by splitting into multiple queries, as given below.

The production system will have a dedicated system for Logstash with 32GB ram, 8 Core CPU and a minimum of 300GB HDD, but is this kind of load still possible with 2 CPU cores, 8GB ram, 70GB HDD?

What is the number that can be said as acceptable number of smaller indexes?

Also, I made further changes after looking here and there, will this increase my chances of getting data faster into Elasticsearch without overwhelming Elasticsearch and crashing either of them?

	input {
	jdbc {
		jdbc_connection_string => "jdbc:oracle:thin:@<host:port>/sname"
		jdbc_user => "xx"
		jdbc_password => "xxxx"
		jdbc_validate_connection => true
		jdbc_driver_library => "/home/app_config/logstash-6.2.4/jdbc_drivers/OJDBC-Full/ojdbc7.jar"
		jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
		schedule => "53 11 * * *"
		statement_filepath =>"/home/app_config/logstash-6.2.4/bin/queries/user_log_pod1.sql"
		jdbc_paging_enabled => true
		jdbc_page_size => 1000
		jdbc_fetch_size => 1000
		clean_run => false
	}
	jdbc {
		jdbc_connection_string => "jdbc:oracle:thin:@<host:port>/sname"
		jdbc_user => "xx"
		jdbc_password => "xxxx"
		jdbc_validate_connection => true
		jdbc_driver_library => "/home/app_config/logstash-6.2.4/jdbc_drivers/OJDBC-Full/ojdbc7.jar"
		jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
		schedule => "53 11 * * *"
		statement_filepath =>"/home/app_config/logstash-6.2.4/bin/queries/user_log_pod2.sql"
		jdbc_paging_enabled => true
		jdbc_page_size => 1000
		jdbc_fetch_size => 1000
		clean_run => false
	}
}
filter {
	 ruby {
		code => "event.set('updatetime_str', event.get('updatetime').time.localtime.strftime('%Y_%m_%d'))"
	}
}
output {
	elasticsearch {
	hosts => "<host:port>"
	index => "user_log_index_%{updatetime_str}"
	document_id => "%{user_log_key}" 
	document_type => "org_user_log"
	}
}

Open for accepting solutions on the latest Elasticsearch version 7.8 as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.