Help to isolate email adresses in a single field

Hello, I'm new in Logstash and I need some help.
I'm processing a big index with like 600 million records, that has a field called "message".
This field data structure is variant, because it was created from different sources.

I need to re-process the whole index capturing email addresses and filtering out the rest.
Also I would need to discard the duplicate email addresses.

Following are some samples of variant "message" data in the source index (just ignore the external quotation marks):

"ok_for_all;824174284;Hanley;Maureen;Hanover;03755-1321;18 Woodmore Dr;;NH;maureenmh@valley.net;maureenmh;valley.net;;;;;;;;;;;279007"

"ok;824174799;Corbe;Herve;Youngstown;44504-1406;560 Tod Ln;;OH;hcorbe@neo.rr.com;hcorbe;neo.rr.com;;;;;;;;;;;279090"

""jip.geer@wxs.nl","p_unknown_email""

""tinsie@hetnet.nl","ok""

"ok_for_all;"6903420";"Joseph";"Hermo";"5 Regent St";"Ste 513N";"Livingston";"NJ";"7039";"973-535-5000";"jhermo@gmsgroup.com";"unknown";"jhermo";"gmsgroup.com";"";"";"";"";"";"";"";"";"";"";"";"91255""

"ok;harriet;wallach;142 monterey pointe dr;;west palm beach;fl;33418;bhavey2001@yahoo.com;bhavey2001;yahoo.com;;"

"email_disabled;"2759190";"Jack";"Malarik";"713 Creekview Dr";"";"Eastlake";"OH";"44095";"";"jackm@c-p-a.com";"unknown";"jackm";"c-p-a.com";"";"";"";"";"";"";"";"";"";"";"";"84973""

The pipeline config is:

input {
elasticsearch {
hosts => "localhost"
index => "filteredemails"
query => '{ "query": { "query_string": { "query": "*" } } }'
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
grok {
patterns_dir => ["/usr/share/logstash/patterns"]
keep_empty_captures => true
match => { "message" => "%{EMAILADDRESS:clean-email}"}
}
grok {
match => {
"clean-email" => ";%{EMAILADDRESS:[email_ok]};"}
}
}
output {
elasticsearch {
index => "isolated.%{[@metadata][_index]}"
document_type => "uax_url_email"
document_id => "%{[@metadata][_id]}"
}
}

But the resulting "clean-email" field fails in some cases.

I found a way to generate an Analyzer and tokenizer to separate email addresses, but I cant figure out the way to use it for the whole dataset. It's:

PUT isolated.filteredemails
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "uax_url_email"
}
},
"filter": [
"email",
"lowercase",
"unique"
]
}
}
}

this analyzer works individually with the message data and tokenizes the email address well, :

POST isolated.filteredemails/_analyze
{
"analyzer": "my_analyzer",
"text": "babylon;ny;11704;bcook211@yahoo.com"
}

I don't know how to use this tokenizer to just get the part of message and grab it into a new field called "clean-email".

Thank you in advance to anyone who could give me a hand on this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.