Monitor millions of emails

netcelli.tux · November 15, 2016, 11:45am

Hello,

I want to be able to store millions of email addresses in ES and check if an email matches. The problem I'm facing is which the best model is.

Solution 1

Extract the username and email domain. This can be done in 2 ways:

N indexes -> 1 type (Solution 1A)

The index is the email domain and I store just the username.
When I want to check if an email address is stored in ES, the search would be fast and quick within an index. If I have foobar@gmail.com, I will have to match 'foobar' in 'gmail.com' index.
Two big problems:

I do not think ES is scalable with many indexes
Waste of space if the index is too small

So I would assume this solution is really bad.

1 big index -> N types (solution 1B)

The type is the email domain and like solution 1A we store the username only.
Although there is less overhead compared to previous solution, it can lead to long search time and it would be good for "small" datasets.

What are the implications if I split the index in multiple ones and use aliases?

Solution 2

Store the email address as it is with the standard tokeniser. Is this solution the same as 1B?
According to ES documentation, the type is simply an additional field which ES applies a filter on.

What are your thoughts?

Thanks

javanna · November 15, 2016, 12:14pm

Hi,
what would your query look like? That is the main question to answer when you need to decide how to structure your data.

I think that using many types is not that useful, nor optimal given that everything is in the same index anyways. Maybe solution 2 is better, you can always keep a "domain" field if needed and filter based on that. As for scaling, you just have to choose the proper number of shards given your queries and your documents.

Cheers
Luca

netcelli.tux · November 15, 2016, 12:42pm

Hi Luca,

thanks for the reply.

The number of records can be like 30M unique emails after six months.

Also the query will be:

{
	"query": {
		"term": {
			"email": "<email to search>"
		}
	}
}

I only need to retrieve a record and that is it.

Thanks

Davide

javanna · November 15, 2016, 1:53pm

Ok it seems like you don't need to group by domain then and use multiple types, one index should be enough.

system · December 13, 2016, 1:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Need help framing index, id and document Elasticsearch	4	830	July 5, 2017
Elasticsearch to store account information Elasticsearch	1	251	July 6, 2017
Es search optimizing question Elasticsearch	4	596	July 5, 2017
Huge database I'm wondering if Elasticsearch can handle Elasticsearch	5	1160	July 5, 2017
Ideas for how to support massive numbers of indices? Elasticsearch	2	265	July 6, 2017

Monitor millions of emails

Solution 1

Solution 2

Related topics