Define indexes? ES good solution?


(Johan) #1

Hi,

New to search stuff doing some research looking at either ES or Solr.

Lets say I want to create an address database to verify addresses with
street name, city, zipcode etc. worldwide.

Each of these fields needs to be searchable. For example you can
search on zipcode to bring up all cities, or vice versa.

I also want to separate the countries so I dont need to search through
the US database when looking up UK address. This concept is easy to
grasp in mySQL etc. By defining country=US ... but in ES will US be
one index? Or will all the streets, city, zipcode + US be one index
each?

How many indexes do you need and how do you count this? And can you
separate databases like this to avoid unnecessary searching?

If the database before import has 50 million rows, how many records
will this end up in ES if each row for example contains street name,
city, zipcode (in theory)?

Do you think ES could search 50 million records/rows under 0.5
seconds? Maybe in a cloud environment? Or do you need a really big
cluster for this?

Any input appreciated.


(David Pilato) #2

Welcome !

You are not in the SQL world anymore. Just forget what you know about SQL searches.

That say, you can now ask yourself : why should I separate countries ?
If country is a field of your address document, just add country=US when you search for US address.

If you have 50m rows that turns into 50m documents, you will have 50m docs in ES.
You can easily create on your laptop 1m or 2m docs and see how much it takes (HD size)...

BTW, you can spread your docs in many indexes if you need to.
You can use routing also : http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

I don't think that you will need a huge cluster to hold 50m simple docs as you describe them.

I would say that it so easy to start with ES that you can try some million docs on your laptop and see where it goes.

HTH
David
@dadoonet

Le 12 mai 2012 à 16:25, Johan ignesia@gmail.com a écrit :

Hi,

New to search stuff doing some research looking at either ES or Solr.

Lets say I want to create an address database to verify addresses with
street name, city, zipcode etc. worldwide.

Each of these fields needs to be searchable. For example you can
search on zipcode to bring up all cities, or vice versa.

I also want to separate the countries so I dont need to search through
the US database when looking up UK address. This concept is easy to
grasp in mySQL etc. By defining country=US ... but in ES will US be
one index? Or will all the streets, city, zipcode + US be one index
each?

How many indexes do you need and how do you count this? And can you
separate databases like this to avoid unnecessary searching?

If the database before import has 50 million rows, how many records
will this end up in ES if each row for example contains street name,
city, zipcode (in theory)?

Do you think ES could search 50 million records/rows under 0.5
seconds? Maybe in a cloud environment? Or do you need a really big
cluster for this?

Any input appreciated.


(Johan) #3

Thanks for reply.

My initial though was that separating countries would make search
faster.

Considering USA is really huge, like 40 million records. And a smaller
country can be at most 1 million records. So searching through the non-
US list would be faster than having to use the same database.

But maybe it does not work this way. And having everything in the same
database wont slow it down?

The address verification is always country specific.

On May 12, 5:20 pm, David Pilato da...@pilato.fr wrote:

Welcome !

You are not in the SQL world anymore. Just forget what you know about SQL searches.

That say, you can now ask yourself : why should I separate countries ?
If country is a field of your address document, just add country=US when you search for US address.

If you have 50m rows that turns into 50m documents, you will have 50m docs in ES.
You can easily create on your laptop 1m or 2m docs and see how much it takes (HD size)...

BTW, you can spread your docs in many indexes if you need to.
You can use routing also :http://www.elasticsearch.org/guide/reference/api/admin-indices-aliase...

I don't think that you will need a huge cluster to hold 50m simple docs as you describe them.

I would say that it so easy to start with ES that you can try some million docs on your laptop and see where it goes.

HTH
David
@dadoonet

Le 12 mai 2012 à 16:25, Johan igne...@gmail.com a écrit :

Hi,

New to search stuff doing some research looking at either ES or Solr.

Lets say I want to create an address database to verify addresses with
street name, city, zipcode etc. worldwide.

Each of these fields needs to be searchable. For example you can
search on zipcode to bring up all cities, or vice versa.

I also want to separate the countries so I dont need to search through
the US database when looking up UK address. This concept is easy to
grasp in mySQL etc. By defining country=US ... but in ES will US be
one index? Or will all the streets, city, zipcode + US be one index
each?

How many indexes do you need and how do you count this? And can you
separate databases like this to avoid unnecessary searching?

If the database before import has 50 million rows, how many records
will this end up in ES if each row for example contains street name,
city, zipcode (in theory)?

Do you think ES could search 50 million records/rows under 0.5
seconds? Maybe in a cloud environment? Or do you need a really big
cluster for this?

Any input appreciated.


(Shay Banon) #4

You can definitely separate to an index per country, it will be faster to
search. But, it won't be by much if you use filters to filter the country,
thanks to how filters work and the fact that they are nicely cached. Its
really up to you.

On Sun, May 13, 2012 at 6:44 PM, Johan ignesia@gmail.com wrote:

Thanks for reply.

My initial though was that separating countries would make search
faster.

Considering USA is really huge, like 40 million records. And a smaller
country can be at most 1 million records. So searching through the non-
US list would be faster than having to use the same database.

But maybe it does not work this way. And having everything in the same
database wont slow it down?

The address verification is always country specific.

On May 12, 5:20 pm, David Pilato da...@pilato.fr wrote:

Welcome !

You are not in the SQL world anymore. Just forget what you know about
SQL searches.

That say, you can now ask yourself : why should I separate countries ?
If country is a field of your address document, just add country=US when
you search for US address.

If you have 50m rows that turns into 50m documents, you will have 50m
docs in ES.
You can easily create on your laptop 1m or 2m docs and see how much it
takes (HD size)...

BTW, you can spread your docs in many indexes if you need to.
You can use routing also :
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliase...

I don't think that you will need a huge cluster to hold 50m simple docs
as you describe them.

I would say that it so easy to start with ES that you can try some
million docs on your laptop and see where it goes.

HTH
David
@dadoonet

Le 12 mai 2012 à 16:25, Johan igne...@gmail.com a écrit :

Hi,

New to search stuff doing some research looking at either ES or Solr.

Lets say I want to create an address database to verify addresses with
street name, city, zipcode etc. worldwide.

Each of these fields needs to be searchable. For example you can
search on zipcode to bring up all cities, or vice versa.

I also want to separate the countries so I dont need to search through
the US database when looking up UK address. This concept is easy to
grasp in mySQL etc. By defining country=US ... but in ES will US be
one index? Or will all the streets, city, zipcode + US be one index
each?

How many indexes do you need and how do you count this? And can you
separate databases like this to avoid unnecessary searching?

If the database before import has 50 million rows, how many records
will this end up in ES if each row for example contains street name,
city, zipcode (in theory)?

Do you think ES could search 50 million records/rows under 0.5
seconds? Maybe in a cloud environment? Or do you need a really big
cluster for this?

Any input appreciated.


(system) #5