Jdbc input sql_last_value from another source?

I've been struggling for a couple days on this. I'm using the jdbc input plugin, but I need to source the sql_last_value from a DB query against another DB rather than having it pull from last_run_metadata_path . So:

  1. fetch the timestamp I need for 'sql_last_value'
  2. Run the JDBC query to fetch the data, plugging in that value I just fetched
  3. spew it all to Elasticsearch

If it's just steps 2 and 3, it's easy! Sadly, using the last_run_metadata_path doesn't work for me because we're running logstash in a container cluster and there's no guarantee how long that instance will live. Once it dies and a new one is spun up, we lose that last_run_metadata_path.

Hence, I'm trying to source it's equivalent from a different location (a DB in this case).

Any suggestions on how to accomplish that?

Conceivably you could use a jdbc input to poll for the sql_last_value, then use a jdbc_streaming filter to fetch the data, then a jdbc output to update sql_last_value. Or possibly a heartbeat filter to schedule things and three jdbc_filters (fetch sql_last_value, fetch data, update sql_last_value). Not sure if you can do an update in a jdbc_streaming filter.

You would have to make sure jdbc_streaming does not reuse a cached value for sql_last_value, and you probably do not want any batching in the pipeline.

I thought that might be the case. I'm not sure why, but jdbc_streaming blows up my heap. As an input jdbc works fine. If I use jdbc_streaming, the heap dies. I suspect because it's copying a large jdbc result set into the target field. And then I have to do split on that field so that the array of jdbc results each becomes a single event for the output. I think split also makes copies.

Seems like a small problem in concept, fetch/set a variable value from somewhere before running the pipeline, but that turns out to be very difficult :frowning:

Thankfully, RobBavey fixed that bug in PR 40. There was a typo in the filter that if you tried to split an event with a very large array then each of the split events would be created with a copy of the complete array that was immediately overwritten with a single entry from it. The GC rates for large arrays were ridiculous, resulting in terrible performance.

Oh! Is there a version out that has that fix? I'm not sure how to tell if I have it, would that be in a new version of logstash or some specific special pull I'd need to do of the split plugin?

I'm certainly happy to give it a shot.

The fix was merged into the main line two years ago. It is not recent.

Damn, then it won't help me. I just downloaded logstash a couple weeks ago. Something about using jdbc_streaming/split causes heap issues.

This works:

input {jdbc} 
filter { some field rename/copy }
output { elasticsearch }

Sadly, this blows up the heap, even though it should conceptually do the same thing:

input { http {get a timestamp} }
filter { jdbc_streaming, field rename/copy }
output { elasticsearch }

Coming at this from another angle, could I use pipeline-to-pipeline communication to enforce a sequence of two pipelines? Combined with using multiple inputs/outputs, would something like this work?

- pipeline.id: upstream
  config.string: |
    input { http{ fetch-timestamp} } 
    output { 
        pipeline { send_to => [my_downstream] }
        file { update the last_run_metadata_path myself with the timestamp }
    }
- pipeline.id: downstream
  config.string: |
    input { 
        pipeline { address => my_downstream} 
        jdbc { 
          uses the sql_last_value I want because upstream pipeline wrote it
          type = "my_jdbc" } 
    }
    filter { throw out if type != "my_jbdc" so I only see the jdbc events, not the upstream pipeline event }
    output { elasticsearch }

This seems really hacky, but would it accomplish what I'm trying to do which is to determine :sql_last_value myself before each pipeline run?

I don't think so. The input cannot reference the fields of an event, and order is not guaranteed.

Are you using paging in the jdbc input? I am wondering if the result set for the query is very large. The input would fetch a subset of the result set and flush it into the pipeline in batches, whereas the filter would fetch the whole thing in a single event.

Yea, it's an SQL with a LIMIT 100000 on it, so it's a big result set. For some business reasons, I can't make it any smaller than that 100,000 limit. I know that sounds silly, but trust me...spent days on that already.

I'm going to explore using AWS EFS, which would give a persistent file system where the logstash container can write the last_run_metadata_path. That will survive the ECS container being destroyed and redeployed.

BTW, thanks for all your help and responsiveness. Having someone willing to be responsive and offer advice is a huge boost for my morale, even if I can't quite do what I was trying to do :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.