Modeling Businesses & Locations in ES


(Michael Sick) #1

HI All,

Looking for some help on some data modeling advice. I'm helping to build a
location based search for businesses that basically breaks down into:
BusinessType (id, description, name)
Business (id, description, name, array[alias], business_type_id)
BusinessLocation (id, location[lat,lon], state, zip, area_code)

The goal is to provide a location based search that uses a distance from a
location[lat,long]. The location information will be used along with
standard text search (on descriptions, names, alias) on analyzed terms and
boolean yes/no searching on unanalyzed terms (BusinessType(id),
Business(id), zip, area_code).

I can think of 3 basic ways of modeling this in ES and wanted to get some
feedback on if there were other/better paths and on the best way forward.
Some key assumptions/considerations for the system are:

  1. Authoritative data will be kept in an alternate data source
    (currently MySQL) and ES is only responsible for searching and returning
    matching identities
  2. BusinessType Type: New types will be very rare, updates to types will
    be very rare
  3. Business Type: New businesses will be relatively rare (a 5+ a week),
    updates will also likely be rare (5+ a week)
  4. Business Location Type: New locations will be less rare (20+ a
    week), updates will rare (5+ a week)
  5. Volume: Initial configuration should support 10K businesses with 500K
    locations total
  6. Search Volume: We're looking to support instant style searching based
    on either the typing of additional terms or scrolling of a map, so the
    search volume will be inflated by frequent updates. Supporting 30
    search/second max with 10 search/second average is our current goal along
    with 100-200ms response times.
  7. We're planning to cloud host initially (either Rackspace or AWS)
  1. One Flat Denormalized Type
  • Create a BusinessLocation type in an ES index that contains 1 entry for
    each location. Information found in the Business and BusinessType parent
    types would be repeated/denormalized.
  1. Parent / Child in ES BusinessType->Business->BusinessLocation
  • Create 3 types in ES ( BusinessType, Business, BusinessLocation) and
    model parent/child relationships
  1. Nested Types in ES BusinessType->Business->BusinessLocation
  • Create 3 types in ES ( BusinessType, Business, BusinessLocation) and
    nested document relationships

Advice and experience would be greatly appreciated.

Thanks!!,

--Mike


(Clinton Gormley) #2

Hi Michael

I can think of 3 basic ways of modeling this in ES and wanted to get
some feedback on if there were other/better paths and on the best way
forward. Some key assumptions/considerations for the system are:
1. Authoritative data will be kept in an alternate data source
(currently MySQL) and ES is only responsible for searching and
returning matching identities
2. BusinessType Type: New types will be very rare, updates to
types will be very rare
3. Business Type: New businesses will be relatively rare (a 5+ a
week), updates will also likely be rare (5+ a week)
4. Business Location Type: New locations will be less rare (20+
a week), updates will rare (5+ a week)
5. Volume: Initial configuration should support 10K businesses
with 500K locations total
6. Search Volume: We're looking to support instant style
searching based on either the typing of additional terms or
scrolling of a map, so the search volume will be inflated by
frequent updates. Supporting 30 search/second max with 10
search/second average is our current goal along with 100-200ms
response times.
7. We're planning to cloud host initially (either Rackspace or
AWS)

  1. One Flat Denormalized Type
  • Create a BusinessLocation type in an ES index that contains 1
    entry for each location. Information found in the Business and
    BusinessType parent types would be repeated/denormalized.
  1. Parent / Child in ES BusinessType->Business->BusinessLocation
  • Create 3 types in ES ( BusinessType, Business, BusinessLocation)
    and model parent/child relationships
  1. Nested Types in ES BusinessType->Business->BusinessLocation
  • Create 3 types in ES ( BusinessType, Business, BusinessLocation)
    and nested document relationships

Some important things to note about how ElasticSearch works:

  1. The fastest search is against a single document - nested or
    parent-child searches have to perform an extra step.

  2. When you do a nested or parent-child search, you have no way
    of knowing which of the sub-documents matched. You just know
    that at least one did match.

  3. You can't sort against a field with multiple values.

    For instance, if your business has multiple locations, you can't do
    a search like:

    • find businesses of type X
    • within 10 km of Lat,Lon
    • sort by distance from Lat,Lon

    because all ES knows is that Business 123 matched the search.It
    doesn't know which location within Business 123 matched, so it
    doesn't know which location to use for sorting.

So you have a few choices, depending on how you want to return your
results.

First, Business Type should just be a field in your Business doc. You
could even have multiple business types in a single field. This field is
just being used as a filter, not to sort by, so you could have:

business_type: ['taxi','limousine','truck_hire']

Then, you need to decide if you are going to incorporate "distance from
point Lat,Lon" into your sort parameters.

If no, then the simplest thing would be to have one doc per business:

{
name: "Joe Bloggs Transport,
business_type: ['taxi','limousine','truck_hire'],
main_tel: "0123 456 789",
main_email: "joe@bloggs.com",
main_address: "1 Main Street, Mysteryville",
locations: [[41,-71],[42,-70]]
}

Or perhaps each location needs its own address and contact details:
{
name: "Joe Bloggs Transport,
business_type: ['taxi','limousine','truck_hire'],
main_tel: "0123 456 789",
main_email: "joe@bloggs.com",
main_address: "1 Main Street, Mysteryville",
locations: [
{
email: "main_branch@bloggs.come",
address: "1 Main Street, Mysteryville",
tel: "0123 456 789",
location: [41,-71],
},
{
email: "sub_branch@bloggs.come",
address: "1 Other Street, Enigmaville",
tel: "0234 567 891",
location: [42,-70],
}
]
}

If you want these locations to be searchable independently (ie match
"other street" and "enigmaville", but not "other street" and
"mysteryville", then these locations should be marked as 'nested'.

Alternatively, if you want to include the "distance from Lat,Lon" in the
sorting, then you need to know which location matches. In other words,
the doc you search on should have a single location only.

Then you need to ask yourself: if two locations for the same business
are returned in the search results, then do I want to show both, or just
show the single business?

In ES, we don't have "field collapsing", in other words you can't say
"if we have two docs with the same business ID, then collapse them into
a single result". So if you want to combine the two matching locations
into a single result, then you will need to do this client side.

Again, the simplest thing would be to have a single doc type:
Business Location, which contains all the info for each business plus
the location.

However, if your Business object has lots of information which isn't
searchable, and so you don't want to repeat it for every location, then
you could index your Business separately from the multiple Business
Location docs, and include in your Location docs everything that needs
to be searchable (plus a business_id)

Then, one you get your location results, you can look up the matching
Business docs in a second step (eg using mget).

hope this makes sense :slight_smile:

clint


(Michael Sick) #3

Hi Clinton,

Thank you for the review. I prototyped the system by keeping the document
flat and it seems to work well so far. We're writing sample queries now and
we'll see how it performs.

Some important things to note about how ElasticSearch works:

  1. The fastest search is against a single document - nested or
    parent-child searches have to perform an extra step.

I wasn't aware of the internals but it makes sense.

  1. When you do a nested or parent-child search, you have no way

of knowing which of the sub-documents matched. You just know
that at least one did match.

That would not work without parsing through the results - which could get
complex.

  1. You can't sort against a field with multiple values.

For instance, if your business has multiple locations, you can't do
a search like:

- find businesses of type X
- within 10 km of Lat,Lon
- sort by distance from Lat,Lon

because all ES knows is that Business 123 matched the search.It
doesn't know which location within Business 123 matched, so it
doesn't know which location to use for sorting.

So you have a few choices, depending on how you want to return your
results.

First, Business Type should just be a field in your Business doc. You
could even have multiple business types in a single field. This field is
just being used as a filter, not to sort by, so you could have:

business_type: ['taxi','limousine','truck_hire']

Then, you need to decide if you are going to incorporate "distance from
point Lat,Lon" into your sort parameters.

We will sort by geo.

If no, then the simplest thing would be to have one doc per business:

{
name: "Joe Bloggs Transport,
business_type: ['taxi','limousine','truck_hire'],
main_tel: "0123 456 789",
main_email: "joe@bloggs.com",
main_address: "1 Main Street, Mysteryville",
locations: [[41,-71],[42,-70]]
}

Or perhaps each location needs its own address and contact details:
{
name: "Joe Bloggs Transport,
business_type: ['taxi','limousine','truck_hire'],
main_tel: "0123 456 789",
main_email: "joe@bloggs.com",
main_address: "1 Main Street, Mysteryville",
locations: [
{
email: "main_branch@bloggs.come",
address: "1 Main Street, Mysteryville",
tel: "0123 456 789",
location: [41,-71],
},
{
email: "sub_branch@bloggs.come",
address: "1 Other Street, Enigmaville",
tel: "0234 567 891",
location: [42,-70],
}
]
}

If you want these locations to be searchable independently (ie match
"other street" and "enigmaville", but not "other street" and
"mysteryville", then these locations should be marked as 'nested'.

Alternatively, if you want to include the "distance from Lat,Lon" in the
sorting, then you need to know which location matches. In other words,
the doc you search on should have a single location only.

Then you need to ask yourself: if two locations for the same business
are returned in the search results, then do I want to show both, or just
show the single business?

In ES, we don't have "field collapsing", in other words you can't say
"if we have two docs with the same business ID, then collapse them into
a single result". So if you want to combine the two matching locations
into a single result, then you will need to do this client side.

Again, the simplest thing would be to have a single doc type:
Business Location, which contains all the info for each business plus
the location.

However, if your Business object has lots of information which isn't
searchable, and so you don't want to repeat it for every location, then
you could index your Business separately from the multiple Business
Location docs, and include in your Location docs everything that needs
to be searchable (plus a business_id)

Then, one you get your location results, you can look up the matching
Business docs in a second step (eg using mget).

hope this makes sense :slight_smile:

Makes perfect sense - thank you for the review!


(system) #4