Logstash jdbc_static filter plugin performing very poorly (nearly not at all)

Running Logstash 7.1.1 on Amazon Linux 2; jdbc_static plugin version 1.0.6 (stock installation for this version of Logstash)
Sending data to Elasticsearch 7.1.1 (hosted on AWS Elasticsearch service)
Logstash is running on OpenJDK 11 (openjdk version "11.0.5" 2019-10-15 LTS)

I have a stream of logging events that arrive at a nearly constant rate of 15000 per second.
I am attempting to enrich (add) two fields based on two separate databases, one column per database.
I have two Logstash instances (8 vCPUs, 64 GB RAM) which are sharing the load evenly and the CPU time consumption is about 15 percent across each vCPU in steady state.

These instances are running on AWS EC2 with EBS storage; I am using memory queues (no persistent queues at all).

Each Logstash is running 32 workers, with a pipeline batch size of 5120, using the pipeline batch delay of 50 ms.

'top -H' shows that the worker threads are getting an equal share of the available CPU.

The pipeline looks like:

Input: Kinesis
Filters: date -> (jdbc_static) -> fingerprint (MD5)
Output: amazon_es

These two Logstash instances have no problem keeping up until I add the jdbc_static filter into the filter part of the pipeline.

When I add the JDBC filter to the pipeline on one of the two instances, throughput on that instance immediately drops to near zero; oddly, after the first minute, only 9999 events go through the pipeline in aggregate, and then no more progress is made.

The second Logstash instance remains healthy and processes the events at the expected rate (about 460000 per minute).

When I swap the old .conf file back in, the pipeline stalls, giving this message:

[2020-01-08T18:19:17,606][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{["LogStash::Filters::JdbcStatic", ...

... so, I stop and start logstash and that allows the instance to return to processing events at the rate of about 460000 per minute.

I have another log stream using a single Logstash instance, configured with the same pipeline, supporting a much smaller volume of events (150 per second) which runs fine for days on end.

Here is the segment of the pipeline conf file (anonymized, of course) which defines the jdbc_filter:

jdbc_static {
loaders => [
{
id => "sql-table1"
query => "SELECT rsid, rsname FROM table1 WHERE IsDeleted = 0"
local_table => "local-table1"
},
{ id => "sql-table2"
query => "SELECT ssid, ssname FROM table2 WHERE IsDeleted = 0"
local_table => "local-table2"
}
]
local_db_objects => [
{
name => "local-table1"
index_columns => ["rsid"]
columns => [
["rsid", "varchar(36)"],
["rsname", "varchar(50)"]
]
},
{
name => "local-table2"
index_columns => ["ssid"]
columns => [
["ssid", "varchar(36)"],
["ssname", "varchar(50)"]
]
}
]
local_lookups => [
{
id => "local1"
query => "select rsname from local-table1 WHERE rsid = :rsid"
parameters => {rsid => "[fields][rsid]"}
target => "[@metadata][f1]"
},
{
id => "local2"
query => "select ssname from local-table2 WHERE ssid = :ssid"
parameters => {ssid => "[fields][ssid]"}
target => "[@metadata][f2]"
}
]

add_field => { "[fields][field1]" => "%{[@metadata][f1][0][rsname]}" }
add_field => { "[fields][field2]" => "%{[@metadata][f2][0][ssname]}" }

staging_directory => "/tmp/logstash/jdbc_static/import_data"
loader_schedule => "* */12 * * *" # run loaders every 12 hours

jdbc_user => "foo"
jdbc_password => "bar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_driver_library => "/etc/logstash/mssql-jdbc-7.4.1.jre11.jar"
jdbc_connection_string => "jdbc:sqlserver://baz.com:1433;databaseName=quux"
}

Here are relevant lines from the jdbc_static initialization, as they appear in logstash logs: (table names are anonymized):

[2019-12-30T17:23:24,674][INFO ][logstash.filters.jdbcstatic] derby.system.home is: /usr/share/logstash
[2019-12-30T17:23:28,760][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table1, fetched 1169 records in: 0.521 seconds
[2019-12-30T17:23:28,878][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table1, saved fetched records to import file in: 0.114 seconds
[2019-12-30T17:23:29,281][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table1, imported all fetched records in: 0.399 seconds
[2019-12-30T17:23:29,413][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table2, fetched 2523 records in: 0.13 seconds
[2019-12-30T17:23:29,575][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table2, saved fetched records to import file in: 0.162 seconds
[2019-12-30T17:23:29,643][INFO ][logstash.filters.jdbc.readwritedatabase] loader sql-table2, imported all fetched records in: 0.067 seconds
[2019-12-30T17:23:29,658][INFO ][logstash.filters.jdbcstatic] Scheduler operations: Scheduled for: these hours in the day [0, 12];these seconds in the minute [0]
[2019-12-30T17:23:29,660][INFO ][logstash.filters.jdbcstatic] Scheduler scan for work frequency is: 2.5

As you can see, the two tables are not large; also the refresh schedule is 12 hours (twice per day). Based on the documentation, these table sizes seem quite reasonable.

These tables are exactly the same ones used on my low-volume instance, and they match the results I get from a SQL query issued from a SQL browser.

Earlier, I had discovered that the jdbc_static plugin was generating at least one warning message for each document that was missing either the "rsid" or "ssid" fields; this was contributing to a lot of useless logging.

Since our log events are not uniform, not every event will contain a field that would be looked up in the database, the overhead of writing the warning log messages was thought to be the culprit.

There were many messages sent to the log file of this form (again, anonymized):

[2019-12-30T21:19:38,304][WARN ][logstash.filters.jdbc.lookup] Parameter field not found in event {:lookup_id=>"local-table1", :invalid_parameters=>["[fields][rsid]"]}
[2019-12-30T21:19:38,304][WARN ][logstash.filters.jdbc.lookup] Parameter field not found in event {:lookup_id=>"local-table2", :invalid_parameters=>["[fields][ssid]"]}

I suppressed that noise in the log4j configuration using:

logger.jdbclookup.name = logstash.filters.jdbc.lookup
logger.jdbclookup.level = error

No other error or warning messages are reported in the logstash-plain.log file, there are no messages in the "slowlog".

There doesn't seem to be any additional Java GC activity associated with running the JDBC plugin over and above the simpler pipeline.

Questions:

  1. What units are the "Scheduler scan for work frequency" expressed in?
  2. What does the "scheduler scan" pertain to?
  3. Am I missing some obvious mis-configuration of this pipeline?

Your local lookups are not querying local tables. Is that correct or a problem with the way you disguised things?

I messed up. I will correct my mistake in my hiding of the original names.

OK, I believe that I've corrected the mistake.

In answer to your questions

  1. If the schedule is a number of seconds, it will check the schedule every 0.3 seconds. If the schedule has other components (yours has hours) it will check every 2.5 seconds. You can see the code here.

  2. I believe that scan means when it checks whether it is noon or midnight (for you)

  3. Not that I can see

I would get a thread dump and see if it is possible to work out what the stalled thread is doing. If you post the thread dump somewhere and link to it I don't mind taking a look.

Thanks. I have been looking at the code, but I'm not very conversant in Ruby.

I guess I probably don't care about the schedule checking, anyway. I was mostly just annoyed that I had to guess about the units.

I was hypothesizing that there was contention for the log file, due to the zillions of warning messages, but the Ruby code seems to guard calls to the logger:

logger.warn? && logger.warn("Parameter field not found in event", :lookup_id => @id, :invalid_parameters => result.invalid_parameters)

so that doesn't seem like a likely explanation because the logger never sees things that are suppressed by my choice of the "error" level.

I'll dig deeper into the Java threads to see what I can find.

As a result of dealing with this issue, I can also say I'm not a fan of the idea of warning on every occurrence of a "missing" field in every document. How many logging sources are so homogeneous that each document always has every field? It's sort of an anti-use-case for Elasticsearch, I think.

Oh, by the way, the loader schedule is wrong, it loads once for each minute of the 0th and 12th hours, so I need to repair that.

OK, I discovered that one of my two SQL statements to load from the remote DB was malformed, I thought that might be the cause of my troubles.

I repaired that issue, as well as my bad loader schedule, and still the behavior persists.

The following is not exactly what you asked for, and I will also look at Java-level thread locks, but perhaps this might help since it provides Ruby context. This shows the expected number of workers and where they are waiting.

When I attempt to restart or stop Logstash, the following is logged (contents of the jdbc_static configuration are omitted)

[2020-01-09T16:29:41,191][ERROR][org.logstash.execution.ShutdownWatcherExt] The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.
[2020-01-09T16:45:41,360][WARN ][org.logstash.execution.ShutdownWatcherExt] {
"inflight_count"=>0, 
"stalling_threads_info"=>{["LogStash::Filters::JdbcStatic",
{"loaders"=> ... ENTIRE CONFIGURATION OF JDBC_STATIC REPLACED HERE }]
=> [
{"thread_id"=>40, "name"=>"[logs]>worker0", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>41, "name"=>"[logs]>worker1", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>42, "name"=>"[logs]>worker2", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:161:in `block in wait'"},
{"thread_id"=>43, "name"=>"[logs]>worker3", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:161:in `block in wait'"},
{"thread_id"=>44, "name"=>"[logs]>worker4", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/adapters/jdbc.rb:250:in `block in execute'"},
{"thread_id"=>45, "name"=>"[logs]>worker5", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>46, "name"=>"[logs]>worker6", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:161:in `block in wait'"},
{"thread_id"=>47, "name"=>"[logs]>worker7", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/adapters/jdbc.rb:250:in `block in execute'"},
{"thread_id"=>48, "name"=>"[logs]>worker8", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>49, "name"=>"[logs]>worker9", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>50, "name"=>"[logs]>worker10", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:161:in `block in wait'"},
{"thread_id"=>51, "name"=>"[logs]>worker11", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>52, "name"=>"[logs]>worker12", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>53, "name"=>"[logs]>worker13", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>54, "name"=>"[logs]>worker14", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>55, "name"=>"[logs]>worker15", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>56, "name"=>"[logs]>worker16", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:106:in `block in hold'"},
{"thread_id"=>57, "name"=>"[logs]>worker17", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>58, "name"=>"[logs]>worker18", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>59, "name"=>"[logs]>worker19", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>60, "name"=>"[logs]>worker20", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>61, "name"=>"[logs]>worker21", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>62, "name"=>"[logs]>worker22", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>63, "name"=>"[logs]>worker23", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>64, "name"=>"[logs]>worker24", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>65, "name"=>"[logs]>worker25", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:161:in `block in wait'"},
{"thread_id"=>66, "name"=>"[logs]>worker26", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>67, "name"=>"[logs]>worker27", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>68, "name"=>"[logs]>worker28", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>69, "name"=>"[logs]>worker29", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>70, "name"=>"[logs]>worker30", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:147:in `block in wait'"},
{"thread_id"=>71, "name"=>"[logs]>worker31", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/sequel-5.20.0/lib/sequel/connection_pool/threaded.rb:161:in `block in wait'"}]}}

I also looked at all of the open issues on GitHub and found Issue 27 ... the discussion within made me suspect that I am facing the same issue described there.

Regardless, my concern about the design decision to warn on every document that has at least one "missing column" remains. I'm considering opening an issue on GitHub on that topic.

Those threads are waiting to obtain a connection to the database. It could indicate a connection leak, a loss of connectivity to the database, or some connections getting wedged. I do not have a database with which I could test how to diagnose that.

I cannot be sure, but I don't think you are hitting issue #27.

When you say "the database", how can I tell whether it's waiting for the remote database (which I hit every 12 hours) or the local Derby database?

netstat tells me that the remote SQL connection is not open or waiting, once the plugin starts and the two queries complete.

Based on this fellow's comment, and the fact that my source tables are 1169 and 2523 rows, respectively, I'm going to try the jdbc_input -> file -> translate filter method he reports gave him good performance.

Also, if that happens to yield decent performance as well as getting rid of the billions of "Parameter field not found ..." messages, it would be an all-out victory.

I haven't seen the rest of the stack trace, so I do not know :slight_smile:

To clarify, that is the entire stack trace that results from the Logstash shutdown attempting to empty and close the pipeline. It is not a "steady state" stack trace.

Sorry for the delay, I have a thread dump produced by 'jcmd Thread.print'.

It is 15251 lines, so how should I get it to you for inspection?

How about pastebin.com?

It's 1.5 megabytes uncompressed (pastebin rejects that) and 32481 bytes compressed (but I can only upload images here).

Do you have another suggestion?

dropbox.com? You can upload a zip file there.

See if this link works:

Windows does not think that is a valid zip file (nor does dropbox).