I'm importing a ton of events into elasticsearch via logstash from a S3 bucket with logs.
Based on a specific UUID in every event I would want to fetch the aggregated info from a MySQL database and update/enrich that same event with that extra data, so it can be used later on Kibana.
web 2016-06-23 15:17:55.612 2016-06-23 14:59:53.000 unstruct 9f8aaa2f-b24f-4c8b-8ca6-f6289f072a85 custom clj-1.1.0-tom-0.2.0 hadoop-1.6.0-common-0.21.0 93.XXX.243.XXX 07eae7d2-04ae-464f-8cd0-98a917bb9975 http://xxx.com/test http xx.com 80 /test {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.xxx/open/jsonschema/1-0-1","data":{"cid":"2253450","eid":"2231323","uid":"21"}}} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36 2016-06-23 14:59:53.000 com.xxx open jsonschema 1-0-1
Within logstash we're able to parse all the parameters, including the ones on jsonschema.
So our idea is, based on cid, eid, and uid values passed, being able to lookup a MySQL database and retrieve some extra info that allow us to enrich the event data. Let's say gender, age, anything else... that would allow us to do more consistent reports on Kibana.
I've found a plugin for Logstash the would do just that but is yet to be developed.
Right now there's no built-in functionality to handle jdbc at the filter stage. Currently there's a community-made jdbc filter by one of the most involved contributors to logstash, but I haven't tested it myself.
An alternative, that I used to do in my last job, is to use a zeromq filter and a small application that speaks zeromq and executes queries using a ruby library like sequel.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.