I've been struggling for a couple days on this. I'm using the jdbc input plugin, but I need to source the sql_last_value from a DB query against another DB rather than having it pull from last_run_metadata_path . So:
fetch the timestamp I need for 'sql_last_value'
Run the JDBC query to fetch the data, plugging in that value I just fetched
spew it all to Elasticsearch
If it's just steps 2 and 3, it's easy! Sadly, using the last_run_metadata_path doesn't work for me because we're running logstash in a container cluster and there's no guarantee how long that instance will live. Once it dies and a new one is spun up, we lose that last_run_metadata_path.
Hence, I'm trying to source it's equivalent from a different location (a DB in this case).
Conceivably you could use a jdbc input to poll for the sql_last_value, then use a jdbc_streaming filter to fetch the data, then a jdbc output to update sql_last_value. Or possibly a heartbeat filter to schedule things and three jdbc_filters (fetch sql_last_value, fetch data, update sql_last_value). Not sure if you can do an update in a jdbc_streaming filter.
You would have to make sure jdbc_streaming does not reuse a cached value for sql_last_value, and you probably do not want any batching in the pipeline.
I thought that might be the case. I'm not sure why, but jdbc_streaming blows up my heap. As an input jdbc works fine. If I use jdbc_streaming, the heap dies. I suspect because it's copying a large jdbc result set into the target field. And then I have to do split on that field so that the array of jdbc results each becomes a single event for the output. I think split also makes copies.
Seems like a small problem in concept, fetch/set a variable value from somewhere before running the pipeline, but that turns out to be very difficult
Thankfully, RobBavey fixed that bug in PR 40. There was a typo in the filter that if you tried to split an event with a very large array then each of the split events would be created with a copy of the complete array that was immediately overwritten with a single entry from it. The GC rates for large arrays were ridiculous, resulting in terrible performance.
Oh! Is there a version out that has that fix? I'm not sure how to tell if I have it, would that be in a new version of logstash or some specific special pull I'd need to do of the split plugin?
Coming at this from another angle, could I use pipeline-to-pipeline communication to enforce a sequence of two pipelines? Combined with using multiple inputs/outputs, would something like this work?
- pipeline.id: upstream
config.string: |
input { http{ fetch-timestamp} }
output {
pipeline { send_to => [my_downstream] }
file { update the last_run_metadata_path myself with the timestamp }
}
- pipeline.id: downstream
config.string: |
input {
pipeline { address => my_downstream}
jdbc {
uses the sql_last_value I want because upstream pipeline wrote it
type = "my_jdbc" }
}
filter { throw out if type != "my_jbdc" so I only see the jdbc events, not the upstream pipeline event }
output { elasticsearch }
This seems really hacky, but would it accomplish what I'm trying to do which is to determine :sql_last_value myself before each pipeline run?
I don't think so. The input cannot reference the fields of an event, and order is not guaranteed.
Are you using paging in the jdbc input? I am wondering if the result set for the query is very large. The input would fetch a subset of the result set and flush it into the pipeline in batches, whereas the filter would fetch the whole thing in a single event.
Yea, it's an SQL with a LIMIT 100000 on it, so it's a big result set. For some business reasons, I can't make it any smaller than that 100,000 limit. I know that sounds silly, but trust me...spent days on that already.
I'm going to explore using AWS EFS, which would give a persistent file system where the logstash container can write the last_run_metadata_path. That will survive the ECS container being destroyed and redeployed.
BTW, thanks for all your help and responsiveness. Having someone willing to be responsive and offer advice is a huge boost for my morale, even if I can't quite do what I was trying to do
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.