Processing old newsgroup messages

Hi, I'm working on a project to index millions of newsgroups messages dating from the early 80s to the present day. The messages are supplied as individual files. I'm using logstash in "read" mode with the multiline codec so that each message forms a unique document in ES. So far so good.
Here's part of a message that's causing me problems

 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!haven!adm!news
 From: postmaster@ddnvx2.afwl.af.mil (SMTP MAILER)
 Newsgroups: comp.unix.wizards
 Subject: Mail not delivered yet, still trying
 Message-ID: <22122@adm.BRL.MIL>
 Date: 18 Jan 90 13:57:44 GMT
 Sender: news@adm.BRL.MIL
 Lines: 1261
 
 
  ----Mail status follows----
 Have been unable to send your mail to <declerck@sun4b.afwl.af.mil>,
 will keep trying for a total of three days.
 At that time your mail will be returned.
 
  ----Transcript of message follows----
 Date: 18 Jan 90 01:54:00 MST
 From: unix-wizards@BRL.MIL
 Subject: UNIX-WIZARDS Digest  V9#050
 To: "declerck" <declerck@sun4b.afwl.af.mil>
 
 Return-Path: <unix-wizards-request@sem.brl.mil>
 Received: from SEM.BRL.MIL by ddnvx2.afwl.af.mil with SMTP ; 
           Thu, 18 Jan 90 01:52:47 MST
 Received: from SEM.BRL.MIL by SEM.BRL.MIL id aa08556; 18 Jan 90 3:02 EST
 Received: from sem.brl.mil by SEM.BRL.MIL id aa08510; 18 Jan 90 2:45 EST
 Date:       Thu, 18 Jan 90 02:45:15 EST
 From:       The Moderator (Mike Muuss) <Unix-Wizards-Request@BRL.MIL>
 To:         UNIX-WIZARDS@BRL.MIL
 Reply-To:   UNIX-WIZARDS@BRL.MIL
 Subject:    UNIX-WIZARDS Digest  V9#050
 Message-ID:  <9001180245.aa08510@SEM.BRL.MIL>
 
 UNIX-WIZARDS Digest          Thu, 18 Jan 1990              V9#050
 
 Today's Topics: etc etc

The problem is that I'm trying to parse the message headers into individual fields, I'm using grok. But as I hope you can see, message headers can also be included as quoted text within the message body, these are just quoted text and shouldn't be processed by logstash.

The actual "real" headers that were processed by the news servers had the format
<header>:<value>
They could appear in any order, not all headers were mandatory and those that were changed over the years. The only consistency I've found is that the first line of the message body is the first line that doesn't have the <header>:<value> format. It could be a whitespace or text.

So how I detect the first line of the message body within logstash and tell it to stop grokking from that point onwards ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.