Defining custom analyzer that understands URLS/hostnames/emails and can exclude patterns


For ex, if I have a message line like:

Mar 14 20:22:41 subdomain.mydomain.colo postfix/smtpd[16862]: NOQUEUE:
reject: RCPT from unknown[]: 450 4.7.1 Client host rejected:
cannot find your reverse hostname, []; proto=ESMTP helo=<> also

The standard tokenizer tokenizes emailaddresses and hostnames (if they
contain "-" in the hostnames). I would like to have emailaddress and
hostnames to not be tokenized. One way I could see that being possible was
using char_filter and replacing "-" with "_" so that it's not tokenized.
It's a way but isn't there a better way without replacing hypens? I also
saw uax_url_email filter which might be of help.

Also, I would like to exclude few words (for ex, "from=<>") from
tokenizing, I could see that being possible by using word_delimiter.

Can anyone please help me with all this together the right way? One more
thing, is it possible to apply both standard and whitespace analyzers on a

Abhijeet Rastogi (shadyabhi)

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
For more options, visit