Help to isolate email adresses in a single field

Hello, I'm new in Logstash and I need some help.
I'm processing a big index with like 600 million records, that has a field called "message".
This field data structure is variant, because it was created from different sources.

I need to re-process the whole index capturing email addresses and filtering out the rest.
Also I would need to discard the duplicate email addresses.

Following are some samples of variant "message" data in the source index (just ignore the external quotation marks):

"ok_for_all;824174284;Hanley;Maureen;Hanover;03755-1321;18 Woodmore Dr;;NH;;maureenmh;;;;;;;;;;;;279007"

"ok;824174799;Corbe;Herve;Youngstown;44504-1406;560 Tod Ln;;OH;;hcorbe;;;;;;;;;;;;279090"



"ok_for_all;"6903420";"Joseph";"Hermo";"5 Regent St";"Ste 513N";"Livingston";"NJ";"7039";"973-535-5000";"";"unknown";"jhermo";"";"";"";"";"";"";"";"";"";"";"";"";"91255""

"ok;harriet;wallach;142 monterey pointe dr;;west palm beach;fl;33418;;bhavey2001;;;"

"email_disabled;"2759190";"Jack";"Malarik";"713 Creekview Dr";"";"Eastlake";"OH";"44095";"";"";"unknown";"jackm";"";"";"";"";"";"";"";"";"";"";"";"";"84973""

The pipeline config is:

input {
elasticsearch {
hosts => "localhost"
index => "filteredemails"
query => '{ "query": { "query_string": { "query": "*" } } }'
size => 500
scroll => "5m"
docinfo => true
filter {
grok {
patterns_dir => ["/usr/share/logstash/patterns"]
keep_empty_captures => true
match => { "message" => "%{EMAILADDRESS:clean-email}"}
grok {
match => {
"clean-email" => ";%{EMAILADDRESS:[email_ok]};"}
output {
elasticsearch {
index => "isolated.%{[@metadata][_index]}"
document_type => "uax_url_email"
document_id => "%{[@metadata][_id]}"

But the resulting "clean-email" field fails in some cases.

I found a way to generate an Analyzer and tokenizer to separate email addresses, but I cant figure out the way to use it for the whole dataset. It's:

PUT isolated.filteredemails
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
"tokenizer": {
"my_tokenizer": {
"type": "uax_url_email"
"filter": [

this analyzer works individually with the message data and tokenizes the email address well, :

POST isolated.filteredemails/_analyze
"analyzer": "my_analyzer",
"text": "babylon;ny;11704;"

I don't know how to use this tokenizer to just get the part of message and grab it into a new field called "clean-email".

Thank you in advance to anyone who could give me a hand on this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.