Dear Team,
Requesting your guidance to solve this problem.
Big Problem: There's a set of old, HUGE XML files (100+ MB per file, not pretty-formatted, and over 100 million characters) with logs that I have been trying to ingest and view via the Filebeat-Logstash-Elastic-Kibana stack. This is archived system bug report data from an application and we're trying to use the power of the Elastic Stack to troubleshoot more efficiently and effectively.
Most files have the following structure.
A <BigMessage>
XML consists of numerous <SmallMessage>
elements and each <SmallMessage>
element has various child elements, of which <DebugInfo>
is encoded XML, further complicating matters. I'm trying to get each <SmallMessage>
as a separate event in Elastic.
Method1: Read the <BigMessage>
pattern, via Filebeat's multiline processor, and send to Logstash, use mutate filters to replace <
with <
and >
with >
so that the next xml filter can parse the resulting XML correctly.
Result:
Filebeat truncates the data, even for an 18MB file thus breaking the input to Logstash and throwing xml_parse_failure and split_xml_failure errors
Method2: Read directly via Logstash, use the multiline codec and mutate filters to replace <
with <
and >
with >
so that the next xml filter can parse the resulting XML correctly.
Result:
Works for small sample data with about 63 <SmallMessage>
elements and breaks for files with about 90+ elements, so didn't bother checking with the real data that has over 10,000 <SmallMessage>
elements
Example format below (yes, there's no newline in many places. Also some names have been changed for confidentiality reasons)
<BigMessage><SmallMessage><MessageId>12345</MessageId><MessageText>Dolor Ipsum Amet</MessageText><LogLevel>1</LogLevel><LogClass>0</LogClass><LogTime>637807498946031499</LogTime><ModuleId>9876</ModuleId><TargetId>0</TargetId><DebugInfo><OtherInfo><ProcessID>1234</ProcessID><ProcessName>Contoso</ProcessName><ThreadID>1</ThreadID><FileName></FileName><Method>LogException()</Method><Linenumber></Linenumber><StackTrace>
Server stack trace: 
at System.ServiceModel.Func1(Uri uri, ISettings Settings)
at System.ServiceModel.Func2(EndpointAddress address, Uri via)
at System.ServiceModel.Func3(EndpointAddress address, Uri via, TimeSpan timeout, TKey& key)
at BaseClient1.CreateChannel()</StackTrace></OtherInfo></DebugInfo></SmallMessage><SmallMessage><MessageId>12346</MessageId><MessageText>Dolor Ipsum Amet</MessageText><LogLevel>1</LogLevel><LogClass>0</LogClass><LogTime>637807498946031500</LogTime><ModuleId>9877</ModuleId><TargetId>0</TargetId><DebugInfo><OtherInfo><ProcessID>1234</ProcessID><ProcessName>Contoso</ProcessName><ThreadID>1</ThreadID><FileName>setup.txt</FileName><Method>LogException()</Method><Linenumber></Linenumber><StackTrace>
Server stack trace: 
at System.ServiceModel.Func1(Uri uri, ISettings Settings)
at System.ServiceModel.Func2(EndpointAddress address, Uri via)
at System.ServiceModel.Func3(EndpointAddress address, Uri via, TimeSpan timeout, TKey& key)
at BaseClient1.CreateChannel()</StackTrace></OtherInfo></DebugInfo></SmallMessage><SmallMessage><MessageId>12346</MessageId><MessageText>Dolor Ipsum Amet</MessageText><LogLevel>1</LogLevel><LogClass>0</LogClass><LogTime>637807498946031510</LogTime><ModuleId>9878</ModuleId><TargetId>0</TargetId><DebugInfo><OtherInfo><ProcessID>1234</ProcessID><ProcessName>Contoso</ProcessName><ThreadID>1</ThreadID><FileName></FileName><Method>LogException()</Method><Linenumber></Linenumber><StackTrace>
Server stack trace: 
at System.ServiceModel.Func1(Uri uri, ISettings Settings)
at System.ServiceModel.Func2(EndpointAddress address, Uri via)
at System.ServiceModel.Func3(EndpointAddress address, Uri via, TimeSpan timeout, TKey& key)
at BaseClient1.CreateChannel()</StackTrace></OtherInfo></DebugInfo></SmallMessage><BigMessage>`
Also, please find the relevant bits of the filebeat.yml below from Attempt1, as I still think filebeat might have the intelligence to decode it early on in the logging pipeline.
filebeat.inputs:
- type: filestream
enabled: true
# parsers:
# - multiline:
# type: pattern
# pattern: '<BigMessage>' #Even tried with <SmallMessage>
# flush_pattern: </BigMessage> #Even tried with </SmallMessage>
# negate: true
# match: after
# timeout: 600s
# max_lines: 1000000
# max_bytes: 10000000000
# processors:
# - truncate_fields:
# fields:
# - message
# max_bytes: 10000000000
# fail_on_error: true
# ignore_missing: false
# Paths that should be crawled and fetched. Glob based paths.
paths:
- D:/testLog.xml
The relevant bits from the logstash.conf are as below (The commented bits are for Method 1):
# input
# {
# beats
# {
# port => 5044
# }
# }
input
{
file
{
path => "D:/testLog.xml"
#path => "D:\bigLog.xml"
mode => "read"
file_completed_action => log
file_completed_log_path => "D:/completedLog.txt"
start_position => "beginning"
sincedb_path => "NULL"
file_chunk_size => 327680
codec => multiline
{
pattern => "<BigMessage>"
negate => true
what => "previous"
#auto_flush_interval => 1
max_bytes => 102400000
max_lines => 100000000
}
}
}
filter
{
#Some of our messages within DebugInfo have xml tags encoded this way, that we will decode via MUTATE before parsing
mutate
{
gsub => ["message","<", "<"]
gsub => ["message",">", ">"]
}
#Main XML parsing of the content of the message field into an XML document
xml
{
source => "message"
target => "BigMessage"
store_xml => true
force_array => true
}
#Splitting the contents of the previous XML into individual records in Elastic - they are stored within the <SmallMessage> element
split
{
field => '[BigMessage][SmallMessage]'
}
#Since some XMLs are very large, Kibana wouldn't be able to display the content, and hence, we remove the original big XML
mutate
{
remove_field => ["message", "event.original"]
}
}
output
{
#For debug only
# stdout
# {
# codec => rubydebug
# }
elasticsearch
{
hosts => ["http://localhost:9200"]
#index => "%{[@metadata][beat]}-%{[@metadata][version]}"
index => "test"
}
}