Logstash-output-webhdfs combine snappy streaming files

The logstash-output-webhdfs plugin can produce framed snappy files (with option snappy_format = stream) to HDFS which, according to the original Snappy documentation, has the advantage "This allows for easy concatenation of
compressed files without the need for re-framing" according to https://github.com/google/snappy/blob/master/framing_format.txt. Now I would indeed like to combine the multiple output files of logstash-output-webhdfs. Using hourly files with 6 nodes and 8 workers each we would be wasting a lot of resources in HDFS otherwise (as file sizes would be much smaller than block size). So is there a "good" way to combine/concat snappy compressed stream-files (produced by logstash-output-webhdfs plugin) to a single file? I tried with hadoop-streaming.jar and "cat" as mapper and reducer, but this actually decompresses all files and one would need to recompress the result again, something that should not be needed thanks to Snappy framing.

Ideas, suggestions?

@ctr I'm going to move this post to the logstash forum as that seems like a better place to find some answers.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.