Previously we have read from Kafka topics containing Avro records using the avro codec and Avro records using Confluent Schema Registry using the avro_schema_registry codec.
Now, we have encrypted Kafka messages. Our encryption process encrypts the raw bytes of the message, not the content inside of the Avro records. This means to get to the data you need to first decrypt the bytes before deserializing. (I hope this makes sense, basically the process is data -> avro bytes -> encrypted bytes meaning that the Kafka message is not a valid Avro record until it's decrypted.)
I see that Logstash has a Cipher filter, I'm wondering if I can use this somehow or if I need to make a new codec? I'm new to Logstash so I'm not certain.
The Kafka plugin is an input and the cipher plugin is a filter (and filters work on input) -- the Avro codec is part of the input plugin, because of that it makes me think we can't use the cipher filter plugin to decrypt the bytes before it comes to the Avro codec, but I may be wrong. Is there a way?
uses the avro gem to convert those bytes to a ruby Hash
creates the Logstash Event
From what I can tell, we're looking to inject a new step between #1 and #2 that decrypts cipherbytes into plainbytes. That way we can do all of the decoding in the codec and emit events that are fully-contextualised from the get-go.
This will require modifications to the avro codec, but we could easily support a pipeline config that looked like this:
If we were to use the plain codec, and emit events from logstash-input-kafka that merely contain a message with our cipherbytes, we would need to add a filter that converted our cipherbytes to plainbytes (using logstash-filter-cipher), and another filter to convert those plainbytes through Avro into the attributes of our Event; there presently is no logstash-filter-avro-decode, but it would be trivial to make one that wraps the avro gem.
That said, it looks like logstash-filter-cipher has some outstanding issues and may need some work before it can be a viable option (e.g., the key attribute is not safeguarded and can show up in debug output, which is less than ideal from a security standpoint).
Additionally, it assumes that the iv used to encrypt the plainbytes will be the first externally-agreed-upon-number of bytes of the input, which may or may not be case for your cipherbytes.
Thanks for the reply @yaauie, one small note, it seems like the hyperlinks on "avro gem" aren't working. Could you please edit?
Other than that it seems like we will have to make new code using either approach, which is acceptable. The approach you describe in the codec approach was what I was originally planning to do. Maybe make a decrypting codec that decorates another so to speak.
If debug output is disabled, is it fine? If so this approach may be acceptable, if not then the first approach may be the only option.
Luckily this is fine for our use-case, the first 16 bytes of the messages are the IV.
If you can, you may want to fork the logstash-codec-avro plugin and add your decryption logic there; if done in a generic way with good test coverage, that is something that would likely be a welcome contribution back to the plugin we maintain (that way, you won't be stuck being the only one maintaining it, and will be able to benefit should we improve the codec in any way in future).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.