Multiline codec behavior on large input files

Hi there
hope you're doing well.
I have a question about the details about how multiline codec processes the input file. based on the file input docs, the file is chunked and lines are being read from the chunks. say we have a large input file (about 5Gb size) in read mode, and there is a pattern in multiline codec to merge lines into one event. my question is that if the codec reads through a chunk and still has not reached the max_bytes or max_lines thresholds, but in the end of current chunk the pattern is not still met, what is the behavior of the multiline codec? does it continue to the next chunk looking for the pattern or it terminates the merged lines and start over the next chunk?
Actually I want to know that is it necessary to alter the default chunk_size in file input if the input file is very large in size and also the pattern for multiline codec may need to contains 1000 or more lines to be merged into single event?

My input is the following and I'm not sure if I really need to consider a big number for file_chunk_size option based on the descriptions above.

input {
    file {
        mode => "read"
        path => [...]
        file_chunk_size => 1024000
        codec => multiline {
            pattern => 'WARC-Type: request'
            negate => true
            what => previous
            auto_flush_interval => 10
            max_lines => 100000
            charset => "UTF-8"
        }
        file_completed_action => "log_and_delete"
        file_completed_log_path => ...
    }
}

chunk processing is way upstream of the codec. The file input uses the filewatch library to read the file. filewatch reads a chunk and splits it into lines. If a chunk doesn't end with a line delimiter it continues reading the line from the next chunk. Once it has a complete line it passes it to the file input. The file input then passes the line to the multiline codec.

I cannot see any way for the chunk size to affect the multiline codec.

1 Like

Thanks Badger. Now I see the workflow and the independency of codec with file read process :+1: