Defining custom analyzer that understands URLS/hostnames/emails and can exclude patterns

Hi,

For ex, if I have a message line like:

Mar 14 20:22:41 subdomain.mydomain.colo postfix/smtpd[16862]: NOQUEUE:
reject: RCPT from unknown[1.2.3.4]: 450 4.7.1 Client host rejected:
cannot find your reverse hostname, [5.6.7.8]; from=erp@misms.net.in
to=a@domain1.com proto=ESMTP helo=<a.domain.net> also
from=<>

The standard tokenizer tokenizes emailaddresses and hostnames (if they
contain "-" in the hostnames). I would like to have emailaddress and
hostnames to not be tokenized. One way I could see that being possible was
using char_filter and replacing "-" with "_" so that it's not tokenized.
It's a way but isn't there a better way without replacing hypens? I also
saw uax_url_email filter which might be of help.

Also, I would like to exclude few words (for ex, "from=<>") from
tokenizing, I could see that being possible by using word_delimiter.

Can anyone please help me with all this together the right way? One more
thing, is it possible to apply both standard and whitespace analyzers on a
field?

--
Regards,
Abhijeet Rastogi (shadyabhi)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.