Monitor millions of emails

Hello,

I want to be able to store millions of email addresses in ES and check if an email matches. The problem I'm facing is which the best model is.

Solution 1

Extract the username and email domain. This can be done in 2 ways:

N indexes -> 1 type (Solution 1A)

The index is the email domain and I store just the username.
When I want to check if an email address is stored in ES, the search would be fast and quick within an index. If I have foobar@gmail.com, I will have to match 'foobar' in 'gmail.com' index.
Two big problems:

  • I do not think ES is scalable with many indexes
  • Waste of space if the index is too small

So I would assume this solution is really bad.

1 big index -> N types (solution 1B)

The type is the email domain and like solution 1A we store the username only.
Although there is less overhead compared to previous solution, it can lead to long search time and it would be good for "small" datasets.

What are the implications if I split the index in multiple ones and use aliases?

Solution 2

Store the email address as it is with the standard tokeniser. Is this solution the same as 1B?
According to ES documentation, the type is simply an additional field which ES applies a filter on.

What are your thoughts?

Thanks

Hi,
what would your query look like? That is the main question to answer when you need to decide how to structure your data.

I think that using many types is not that useful, nor optimal given that everything is in the same index anyways. Maybe solution 2 is better, you can always keep a "domain" field if needed and filter based on that. As for scaling, you just have to choose the proper number of shards given your queries and your documents.

Cheers
Luca

Hi Luca,

thanks for the reply.

The number of records can be like 30M unique emails after six months.

Also the query will be:

{
	"query": {
		"term": {
			"email": "<email to search>"
		}
	}
}

I only need to retrieve a record and that is it.

Thanks

Davide

Ok it seems like you don't need to group by domain then and use multiple types, one index should be enough.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.