Hi, I'm working on a project to index millions of newsgroups messages dating from the early 80s to the present day. The messages are supplied as individual files. I'm using logstash in "read" mode with the multiline codec so that each message forms a unique document in ES. So far so good.
Here's part of a message that's causing me problems
Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!haven!adm!news From: email@example.com (SMTP MAILER) Newsgroups: comp.unix.wizards Subject: Mail not delivered yet, still trying Message-ID: <22122@adm.BRL.MIL> Date: 18 Jan 90 13:57:44 GMT Sender: news@adm.BRL.MIL Lines: 1261 ----Mail status follows---- Have been unable to send your mail to <firstname.lastname@example.org>, will keep trying for a total of three days. At that time your mail will be returned. ----Transcript of message follows---- Date: 18 Jan 90 01:54:00 MST From: unix-wizards@BRL.MIL Subject: UNIX-WIZARDS Digest V9#050 To: "declerck" <email@example.com> Return-Path: <firstname.lastname@example.org> Received: from SEM.BRL.MIL by ddnvx2.afwl.af.mil with SMTP ; Thu, 18 Jan 90 01:52:47 MST Received: from SEM.BRL.MIL by SEM.BRL.MIL id aa08556; 18 Jan 90 3:02 EST Received: from sem.brl.mil by SEM.BRL.MIL id aa08510; 18 Jan 90 2:45 EST Date: Thu, 18 Jan 90 02:45:15 EST From: The Moderator (Mike Muuss) <Unix-Wizards-Request@BRL.MIL> To: UNIX-WIZARDS@BRL.MIL Reply-To: UNIX-WIZARDS@BRL.MIL Subject: UNIX-WIZARDS Digest V9#050 Message-ID: <9001180245.aa08510@SEM.BRL.MIL> UNIX-WIZARDS Digest Thu, 18 Jan 1990 V9#050 Today's Topics: etc etc
The problem is that I'm trying to parse the message headers into individual fields, I'm using grok. But as I hope you can see, message headers can also be included as quoted text within the message body, these are just quoted text and shouldn't be processed by logstash.
The actual "real" headers that were processed by the news servers had the format
They could appear in any order, not all headers were mandatory and those that were changed over the years. The only consistency I've found is that the first line of the message body is the first line that doesn't have the
<header>:<value> format. It could be a whitespace or text.
So how I detect the first line of the message body within logstash and tell it to stop grokking from that point onwards ?