Running Logstash 7.1.1 on Amazon Linux 2; jdbc_static plugin version 1.0.6 (stock installation for this version of Logstash)
Sending data to Elasticsearch 7.1.1 (hosted on AWS Elasticsearch service)
Logstash is running on OpenJDK 11 (openjdk version "11.0.5" 2019-10-15 LTS)
I have a stream of logging events that arrive at a nearly constant rate of 15000 per second.
I am attempting to enrich (add) two fields based on two separate databases, one column per database.
I have two Logstash instances (8 vCPUs, 64 GB RAM) which are sharing the load evenly and the CPU time consumption is about 15 percent across each vCPU in steady state.
These instances are running on AWS EC2 with EBS storage; I am using memory queues (no persistent queues at all).
Each Logstash is running 32 workers, with a pipeline batch size of 5120, using the pipeline batch delay of 50 ms.
'top -H' shows that the worker threads are getting an equal share of the available CPU.
The pipeline looks like:
Input: Kinesis
Filters: date -> (jdbc_static) -> fingerprint (MD5)
Output: amazon_es
These two Logstash instances have no problem keeping up until I add the jdbc_static filter into the filter part of the pipeline.
When I add the JDBC filter to the pipeline on one of the two instances, throughput on that instance immediately drops to near zero; oddly, after the first minute, only 9999 events go through the pipeline in aggregate, and then no more progress is made.
The second Logstash instance remains healthy and processes the events at the expected rate (about 460000 per minute).
When I swap the old .conf file back in, the pipeline stalls, giving this message:
[2020-01-08T18:19:17,606][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{["LogStash::Filters::JdbcStatic", ...
... so, I stop and start logstash and that allows the instance to return to processing events at the rate of about 460000 per minute.
I have another log stream using a single Logstash instance, configured with the same pipeline, supporting a much smaller volume of events (150 per second) which runs fine for days on end.
Here is the segment of the pipeline conf file (anonymized, of course) which defines the jdbc_filter:
jdbc_static {
loaders => [
{
id => "sql-table1"
query => "SELECT rsid, rsname FROM table1 WHERE IsDeleted = 0"
local_table => "local-table1"
},
{ id => "sql-table2"
query => "SELECT ssid, ssname FROM table2 WHERE IsDeleted = 0"
local_table => "local-table2"
}
]
local_db_objects => [
{
name => "local-table1"
index_columns => ["rsid"]
columns => [
["rsid", "varchar(36)"],
["rsname", "varchar(50)"]
]
},
{
name => "local-table2"
index_columns => ["ssid"]
columns => [
["ssid", "varchar(36)"],
["ssname", "varchar(50)"]
]
}
]
local_lookups => [
{
id => "local1"
query => "select rsname from local-table1 WHERE rsid = :rsid"
parameters => {rsid => "[fields][rsid]"}
target => "[@metadata][f1]"
},
{
id => "local2"
query => "select ssname from local-table2 WHERE ssid = :ssid"
parameters => {ssid => "[fields][ssid]"}
target => "[@metadata][f2]"
}
]
add_field => { "[fields][field1]" => "%{[@metadata][f1][0][rsname]}" }
add_field => { "[fields][field2]" => "%{[@metadata][f2][0][ssname]}" }
staging_directory => "/tmp/logstash/jdbc_static/import_data"
loader_schedule => "* */12 * * *" # run loaders every 12 hours
jdbc_user => "foo"
jdbc_password => "bar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_driver_library => "/etc/logstash/mssql-jdbc-7.4.1.jre11.jar"
jdbc_connection_string => "jdbc:sqlserver://baz.com:1433;databaseName=quux"
}
Here are relevant lines from the jdbc_static initialization, as they appear in logstash logs: (table names are anonymized):
[2019-12-30T17:23:24,674][INFO ][logstash.filters.jdbcstatic] derby.system.home is: /usr/share/logstash
[2019-12-30T17:23:28,760][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table1, fetched 1169 records in: 0.521 seconds
[2019-12-30T17:23:28,878][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table1, saved fetched records to import file in: 0.114 seconds
[2019-12-30T17:23:29,281][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table1, imported all fetched records in: 0.399 seconds
[2019-12-30T17:23:29,413][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table2, fetched 2523 records in: 0.13 seconds
[2019-12-30T17:23:29,575][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table2, saved fetched records to import file in: 0.162 seconds
[2019-12-30T17:23:29,643][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table2, imported all fetched records in: 0.067 seconds
[2019-12-30T17:23:29,658][INFO ][logstash.filters.jdbcstatic] Scheduler operations: Scheduled for: these hours in the day [0, 12];these seconds in the minute [0]
[2019-12-30T17:23:29,660][INFO ][logstash.filters.jdbcstatic] Scheduler scan for work frequency is: 2.5
As you can see, the two tables are not large; also the refresh schedule is 12 hours (twice per day). Based on the documentation, these table sizes seem quite reasonable.
These tables are exactly the same ones used on my low-volume instance, and they match the results I get from a SQL query issued from a SQL browser.
Earlier, I had discovered that the jdbc_static plugin was generating at least one warning message for each document that was missing either the "rsid" or "ssid" fields; this was contributing to a lot of useless logging.
Since our log events are not uniform, not every event will contain a field that would be looked up in the database, the overhead of writing the warning log messages was thought to be the culprit.
There were many messages sent to the log file of this form (again, anonymized):
[2019-12-30T21:19:38,304][WARN ][logstash.filters.jdbc.lookup] Parameter field not found in event {:lookup_id=>"local-table1", :invalid_parameters=>["[fields][rsid]"]}
[2019-12-30T21:19:38,304][WARN ][logstash.filters.jdbc.lookup] Parameter field not found in event {:lookup_id=>"local-table2", :invalid_parameters=>["[fields][ssid]"]}
I suppressed that noise in the log4j configuration using:
logger.jdbclookup.name = logstash.filters.jdbc.lookup
logger.jdbclookup.level = error
No other error or warning messages are reported in the logstash-plain.log file, there are no messages in the "slowlog".
There doesn't seem to be any additional Java GC activity associated with running the JDBC plugin over and above the simpler pipeline.
Questions:
- What units are the "Scheduler scan for work frequency" expressed in?
- What does the "scheduler scan" pertain to?
- Am I missing some obvious mis-configuration of this pipeline?