Hi, I'm working on a project to index millions of newsgroups messages dating from the early 80s to the present day. The messages are supplied as individual files. I'm using logstash in "read" mode with the multiline codec so that each message forms a unique document in ES. So far so good.
Here's part of a message that's causing me problems
Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!haven!adm!news
From: postmaster@ddnvx2.afwl.af.mil (SMTP MAILER)
Newsgroups: comp.unix.wizards
Subject: Mail not delivered yet, still trying
Message-ID: <22122@adm.BRL.MIL>
Date: 18 Jan 90 13:57:44 GMT
Sender: news@adm.BRL.MIL
Lines: 1261
----Mail status follows----
Have been unable to send your mail to <declerck@sun4b.afwl.af.mil>,
will keep trying for a total of three days.
At that time your mail will be returned.
----Transcript of message follows----
Date: 18 Jan 90 01:54:00 MST
From: unix-wizards@BRL.MIL
Subject: UNIX-WIZARDS Digest V9#050
To: "declerck" <declerck@sun4b.afwl.af.mil>
Return-Path: <unix-wizards-request@sem.brl.mil>
Received: from SEM.BRL.MIL by ddnvx2.afwl.af.mil with SMTP ;
Thu, 18 Jan 90 01:52:47 MST
Received: from SEM.BRL.MIL by SEM.BRL.MIL id aa08556; 18 Jan 90 3:02 EST
Received: from sem.brl.mil by SEM.BRL.MIL id aa08510; 18 Jan 90 2:45 EST
Date: Thu, 18 Jan 90 02:45:15 EST
From: The Moderator (Mike Muuss) <Unix-Wizards-Request@BRL.MIL>
To: UNIX-WIZARDS@BRL.MIL
Reply-To: UNIX-WIZARDS@BRL.MIL
Subject: UNIX-WIZARDS Digest V9#050
Message-ID: <9001180245.aa08510@SEM.BRL.MIL>
UNIX-WIZARDS Digest Thu, 18 Jan 1990 V9#050
Today's Topics: etc etc
The problem is that I'm trying to parse the message headers into individual fields, I'm using grok. But as I hope you can see, message headers can also be included as quoted text within the message body, these are just quoted text and shouldn't be processed by logstash.
The actual "real" headers that were processed by the news servers had the format
<header>:<value>
They could appear in any order, not all headers were mandatory and those that were changed over the years. The only consistency I've found is that the first line of the message body is the first line that doesn't have the <header>:<value>
format. It could be a whitespace or text.
So how I detect the first line of the message body within logstash and tell it to stop grokking from that point onwards ?