Help Improving Performance

(TestCandidate) #1

So I am logging lots of requests coming into my server.
However, in order to identify some details about the user making the request, I am matching the request IP and request time to my radius db, and fetching some results to add them to the log.

I am using jdbc_streaming to fetch the results as following:

filter {
  if "tcplog" in [tags] {
   ### PARSE CSV ###
   csv {
      columns => [
        separator => ";"
        remove_field => ["remove1", "remove2"]

   ### END CSV PARSING ###
   mutate { gsub => [ "reqdate", "EEST", "Europe/Moscow" ] }
   ### END TIMEZONE ###
   date {
        match => [ "reqdate", "MMM dd, yyyy HH:mm:ss.SSSSSSSSS ZZZ"]
        target => "reqdate"
  ### GET MYSQL ###

  jdbc_streaming {
    jdbc_driver_library => "/etc/logstash/mysql-connector-java-5.1.46/mysql-connector-java-5.1.46-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/ispradius"
    jdbc_user => "user"
    jdbc_password => "password"
    statement => "SELECT User.Id as UserId, User.UserName, User.FirstName, User.LastName, User.Mobile, User.Phone, User.MailAddress, User.Address FROM ispradius.User WHERE User.Id = ( Select UserId From UserNasSession WHERE ( (StartSession <= :REQDATE AND EndSession >= :REQDATE ) OR (StartSession <= :REQDATE AND EndSession IS NULL) ) And IpAddress = :UIPADDR);"
    parameters => { 
			"UIPADDR" => "userip"
			"REQDATE" => "reqdate"
##   parameters => { "UIPADDR" => "userip"}
    target => "user_details"

  ### END MYSQL ###

This works well. Except, It's too slow!.

From my point of view, I believe the bottleneck is the mysql. I think jdbc is failing to cache because my query contains the request timestamp. which is pretty much never repeats.

Is there any advice on how to improve this?

On another note: does jdbc_streaming open / close connection for every log attempt? this could be losing time as well.

Thank you

(Christian Dahlqvist) #2

Although I have not looked at the code, I would be very surprised if persistent connections were not used. Looking up based on a query that can not be cached will always be costly. One way to potentially make it a bit more efficient could be to load and periodically refresh closed sessions (EndSession not NULL) using the jdbc_static filter (I am assuming these do not change once closed) and only use the jdbc_streaming filter in case no match was found for a closed session. Whether this will make any real difference naturally depends on how large portion of your events relate to closed sessions.

(TestCandidate) #3

Thank you for taking the time to reply.

Unfortunately, the closed sessions are only used to index old data. (which is a rare case).
while requests are coming in in real time, they're definitely using an open session.

and obvs. caching sessions that didn't end would mess with the validity of my data.

the query itself to the db is acceptable as everything is indexed and the db is local.
however, i'm not sure why logstash is slower than it should be (hence my question regarding whether or not the jdbc_streaming maintains a persistent connection or does it open / close on every request? bcz i assume that would be pretty costly)

(Christian Dahlqvist) #4

Suspected closed sessions would be rare. What throughput are you seeing? How many pipeline workers is Logstash using? How long are sessions for an IP address typically? How long until a new session is started for an IP once a session has finished?

(TestCandidate) #5

Logstash is on its default configuration (I'm not expert enough to mess around with those, but any suggestion there would be helpful.)

For the system itself, well this is gonna be funny to explain (:
a session lifetime cannot be predicted. In my country, a lot of power outages occurs on a daily basis. some users have battery backup power, some don't. anyway, when they disconnect / reconnect, their session is reset. not to mention other uncontrolled behavior like people restarting their routers / etc.

Once a session ends, a new session is automatically created when the user reconnects. so all of his new requests will be within his new session. all the requests made before disconnecting are to be logged using the info of his prev. session.

(Christian Dahlqvist) #6

If you knew there were a minimum delay between a session for an IP ending and the next one starting, you might be able to have one jdbc_static filter for closed sessions and one for open sessions as long as you can find a refresh interval that is less than the time it takes for a new session to start. You could then only fall back on the current jdbc_streaming filter when there is no match. I suspect this in the end will be a tradeoff between accuracy and performance.

As a lot of time is likely to be spent making requests to the database, you could perhaps increase throughput by increasing the number of pipeline workers (assuming your database can handle the concurrency).

If you install Logstash monitoring, or make use of the monitoring API you may also get a better view of what is taking time in your pipeline.

(TestCandidate) #7

Thank you again for taking the time to answer.

I will try increasing the # of pipeline workers and see if that improves anything.

And will definetly check the monitoring plugin to try and understand more the bottleneck.

I'll come back with results tomorrow.

(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.