Indexing around 140 million addresses - need some performance tips

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

I think you should try to experiment
if you do that just search for a string in a string?

{
full_address: "243 Broadway, New York, NY 10060",
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

On Wednesday, September 4, 2013 3:30:30 PM UTC-4, Anthony Campagna wrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

What exactly do you mean by searching for a string within a string? A fuzzy
search? If so, I was under the impression that fuzzy searches are
incredibly slow and resource intensive compared to other methods of
searching.

I do plan on doing plenty of exprimenting but I have two barriers to that:

  1. There are just so many options available to me, I was curious what
    others thing or have found to be successful
  2. I'm not exactly sure how to quantify how resource-intensive/taxing a
    single query is on an elasticsearch cluster

On Wednesday, September 4, 2013 3:48:29 PM UTC-4, Max Seleznev wrote:

I think you should try to experiment
if you do that just search for a string in a string?

{
full_address: "243 Broadway, New York, NY 10060",
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

On Wednesday, September 4, 2013 3:30:30 PM UTC-4, Anthony Campagna wrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No, I did not mean fuzzy search.

what are you use the scheme? what types and analyzers?
can you show full scheme?

On Wednesday, September 4, 2013 4:04:27 PM UTC-4, Anthony Campagna wrote:

What exactly do you mean by searching for a string within a string? A
fuzzy search? If so, I was under the impression that fuzzy searches are
incredibly slow and resource intensive compared to other methods of
searching.

I do plan on doing plenty of exprimenting but I have two barriers to that:

  1. There are just so many options available to me, I was curious what
    others thing or have found to be successful
  2. I'm not exactly sure how to quantify how resource-intensive/taxing a
    single query is on an elasticsearch cluster

On Wednesday, September 4, 2013 3:48:29 PM UTC-4, Max Seleznev wrote:

I think you should try to experiment
if you do that just search for a string in a string?

{
full_address: "243 Broadway, New York, NY 10060",
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

On Wednesday, September 4, 2013 3:30:30 PM UTC-4, Anthony Campagna wrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be
searching against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I have not gotten that far yet. The whole point of this thread is to see
what others suggest or have done. Scheme and Analyzers would be completely
determined on what route I decide to go, which might be heavily influenced
by this thread.

On Wednesday, September 4, 2013 4:13:56 PM UTC-4, Max Seleznev wrote:

No, I did not mean fuzzy search.

what are you use the scheme? what types and analyzers?
can you show full scheme?

On Wednesday, September 4, 2013 4:04:27 PM UTC-4, Anthony Campagna wrote:

What exactly do you mean by searching for a string within a string? A
fuzzy search? If so, I was under the impression that fuzzy searches are
incredibly slow and resource intensive compared to other methods of
searching.

I do plan on doing plenty of exprimenting but I have two barriers to that:

  1. There are just so many options available to me, I was curious what
    others thing or have found to be successful
  2. I'm not exactly sure how to quantify how resource-intensive/taxing a
    single query is on an elasticsearch cluster

On Wednesday, September 4, 2013 3:48:29 PM UTC-4, Max Seleznev wrote:

I think you should try to experiment
if you do that just search for a string in a string?

{
full_address: "243 Broadway, New York, NY 10060",
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

On Wednesday, September 4, 2013 3:30:30 PM UTC-4, Anthony Campagna wrote:

I am about to begin a project to index 140 million documents with
street addresses, city, state, and zip. I will need to do searches against
the entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be
searching against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

One thing you should consider is how to handle synonyms in the street name.

So for example:

123 North Main Street could be equivalent to
123 N Main Street
123 N Main St
123 North Main St

500 Second Street
500 2nd St
500 2nd Street

Also consider stripping out commas and other delimiters if you store the
entire address as one field

So users may search with commas and some may search without commas.

Author and Instructor for the Upcoming Book and Lecture Series
Massive Log Data Aggregation, Processing, Searching and Visualization with
Open Source Software

http://massivelogdata.com

On Wed, Sep 4, 2013 at 3:30 PM, Anthony Campagna gucommander@gmail.comwrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yea, i'm not 100% sure how i'm going to handle synonyms yet but it's on my
radar.

On Wednesday, September 4, 2013 4:33:14 PM UTC-4, Israel Ekpo wrote:

One thing you should consider is how to handle synonyms in the street name.

So for example:

123 North Main Street could be equivalent to
123 N Main Street
123 N Main St
123 North Main St

500 Second Street
500 2nd St
500 2nd Street

Also consider stripping out commas and other delimiters if you store the
entire address as one field

So users may search with commas and some may search without commas.

Street name - Wikipedia

Author and Instructor for the Upcoming Book and Lecture Series
Massive Log Data Aggregation, Processing, Searching and Visualization
with Open Source Software

http://massivelogdata.com

On Wed, Sep 4, 2013 at 3:30 PM, Anthony Campagna <gucom...@gmail.com<javascript:>

wrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

yes, you right

google maps all this can do

i search: second ave New York, NY and Google maps show me correct address

Did you mean:
*2nd Ave, New York, NY 10022https://maps.google.com/maps?q=2nd+Ave,+New+York,+10022&hl=en&sll=40.758816,-73.974703&sspn=0.007054,0.015814&hnear=Second+Ave,+New+York&t=m&ie=UTF8&oi=georefine&ct=clnk&cd=2&geocode=FSDubQIdUTyX-w&split=0
*

On Wednesday, September 4, 2013 4:33:14 PM UTC-4, Israel Ekpo wrote:

One thing you should consider is how to handle synonyms in the street name.

So for example:

123 North Main Street could be equivalent to
123 N Main Street
123 N Main St
123 North Main St

500 Second Street
500 2nd St
500 2nd Street

Also consider stripping out commas and other delimiters if you store the
entire address as one field

So users may search with commas and some may search without commas.

Street name - Wikipedia

Author and Instructor for the Upcoming Book and Lecture Series
Massive Log Data Aggregation, Processing, Searching and Visualization
with Open Source Software

http://massivelogdata.com

On Wed, Sep 4, 2013 at 3:30 PM, Anthony Campagna <gucom...@gmail.com<javascript:>

wrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

then your search results will not be as good
you need use synonyms

On Wednesday, September 4, 2013 4:45:12 PM UTC-4, Anthony Campagna wrote:

Yea, i'm not 100% sure how i'm going to handle synonyms yet but it's on my
radar.

On Wednesday, September 4, 2013 4:33:14 PM UTC-4, Israel Ekpo wrote:

One thing you should consider is how to handle synonyms in the street
name.

So for example:

123 North Main Street could be equivalent to
123 N Main Street
123 N Main St
123 North Main St

500 Second Street
500 2nd St
500 2nd Street

Also consider stripping out commas and other delimiters if you store the
entire address as one field

So users may search with commas and some may search without commas.

Street name - Wikipedia

Author and Instructor for the Upcoming Book and Lecture Series
Massive Log Data Aggregation, Processing, Searching and Visualization
with Open Source Software

http://massivelogdata.com

On Wed, Sep 4, 2013 at 3:30 PM, Anthony Campagna gucom...@gmail.comwrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be
searching against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I know. It's on my radar. I'm just not 100% sure how i'm going to handle it
yet. Having synonyms for N,S,E,W,NE,NW,SE,SW is fine. But to create a
synonym dictionary of every number up to 200th is a lot to do. There might
be a better way of doing it than a dictionary for numbers.

On Wednesday, September 4, 2013 4:47:21 PM UTC-4, Max Seleznev wrote:

then your search results will not be as good
you need use synonyms

On Wednesday, September 4, 2013 4:45:12 PM UTC-4, Anthony Campagna wrote:

Yea, i'm not 100% sure how i'm going to handle synonyms yet but it's on
my radar.

On Wednesday, September 4, 2013 4:33:14 PM UTC-4, Israel Ekpo wrote:

One thing you should consider is how to handle synonyms in the street
name.

So for example:

123 North Main Street could be equivalent to
123 N Main Street
123 N Main St
123 North Main St

500 Second Street
500 2nd St
500 2nd Street

Also consider stripping out commas and other delimiters if you store the
entire address as one field

So users may search with commas and some may search without commas.

Street name - Wikipedia

Author and Instructor for the Upcoming Book and Lecture Series
Massive Log Data Aggregation, Processing, Searching and Visualization
with Open Source Software

http://massivelogdata.com

On Wed, Sep 4, 2013 at 3:30 PM, Anthony Campagna gucom...@gmail.comwrote:

I am about to begin a project to index 140 million documents with
street addresses, city, state, and zip. I will need to do searches against
the entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be
searching against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you can share more information about the dataset and how you plan to use
the the data, you may get better recommendations.

If you plan to do local search that is limited to specific geographical
areas, it might be helpful to have the geo point type added as one of the
fields.

This could simplify the process of narrowing down searches to a specific
radius across multiple states, if necessary.

Are all these addresses within the United States? Or do you have other
countries and territories?

How are they distributed in terms of number of documents per state?

Your responses to these questions will influence how the architecture is
designed and configured.

Author and Instructor for the Upcoming Book and Lecture Series
Massive Log Data Aggregation, Processing, Searching and Visualization with
Open Source Software

http://massivelogdata.com

On Wed, Sep 4, 2013 at 3:30 PM, Anthony Campagna gucommander@gmail.comwrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am working with just a small sample of a dataset that I am negotiating a
purchase for. Assume the dataset just has what I posted in my original post
plus lat/lon information. The dataset has 90%+ of all addresses in the
United States, and only within the United States. Number of documents per
state are unknown at this time, but will ultimately be fairly similar to
the distribution of populations across the states. While we will have
geopoints for all addresses, they will only be used for scoring. The
address search must be done across the entire dataset. If I type in 100 8th
st it should give me the top 10 closest results if there are more than 10,
if there are less than 10 then it must be able to give me as many addresses
as there are in the index reguardless of location.

On Wednesday, September 4, 2013 4:54:07 PM UTC-4, Israel Ekpo wrote:

If you can share more information about the dataset and how you plan to
use the the data, you may get better recommendations.

If you plan to do local search that is limited to specific geographical
areas, it might be helpful to have the geo point type added as one of the
fields.

Elasticsearch Platform — Find real-time answers at scale | Elastic

This could simplify the process of narrowing down searches to a specific
radius across multiple states, if necessary.

Are all these addresses within the United States? Or do you have other
countries and territories?

How are they distributed in terms of number of documents per state?

Your responses to these questions will influence how the architecture is
designed and configured.

Author and Instructor for the Upcoming Book and Lecture Series
Massive Log Data Aggregation, Processing, Searching and Visualization
with Open Source Software

http://massivelogdata.com

On Wed, Sep 4, 2013 at 3:30 PM, Anthony Campagna <gucom...@gmail.com<javascript:>

wrote:

I am about to begin a project to index 140 million documents with street
addresses, city, state, and zip. I will need to do searches against the
entire index everytime a user types in a letter in our search. I was
wondering if there was any way to organize street addresses in my index in
a smarter way than just dumping them all in a single index and type. Maybe
a different type for each number? Or maybe it's a clever way of using
nested or parent/child objects/documents. Or maybe it's not necessary at
all and we can just rely on a good use of filters. If that's the case, any
suggestions on how to filter this while doing searches?

Example of a street address (which is the field that we will be searching
against): 243 Broadway, New York, NY 10060

Example document:

{
street_address: "243 broadway",
street_number: 243,
street_name: "broadway",
city: "New York",
state: "NY",
zip: "10060"
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.