Hi Michael
I can think of 3 basic ways of modeling this in ES and wanted to get
some feedback on if there were other/better paths and on the best way
forward. Some key assumptions/considerations for the system are:
1. Authoritative data will be kept in an alternate data source
(currently MySQL) and ES is only responsible for searching and
returning matching identities
2. BusinessType Type: New types will be very rare, updates to
types will be very rare
3. Business Type: New businesses will be relatively rare (a 5+ a
week), updates will also likely be rare (5+ a week)
4. Business Location Type: New locations will be less rare (20+
a week), updates will rare (5+ a week)
5. Volume: Initial configuration should support 10K businesses
with 500K locations total
6. Search Volume: We're looking to support instant style
searching based on either the typing of additional terms or
scrolling of a map, so the search volume will be inflated by
frequent updates. Supporting 30 search/second max with 10
search/second average is our current goal along with 100-200ms
response times.
7. We're planning to cloud host initially (either Rackspace or
AWS)
- One Flat Denormalized Type
- Create a BusinessLocation type in an ES index that contains 1
entry for each location. Information found in the Business and
BusinessType parent types would be repeated/denormalized.
- Parent / Child in ES BusinessType->Business->BusinessLocation
- Create 3 types in ES ( BusinessType, Business, BusinessLocation)
and model parent/child relationships
- Nested Types in ES BusinessType->Business->BusinessLocation
- Create 3 types in ES ( BusinessType, Business, BusinessLocation)
and nested document relationships
Some important things to note about how Elasticsearch works:
-
The fastest search is against a single document - nested or
parent-child searches have to perform an extra step.
-
When you do a nested or parent-child search, you have no way
of knowing which of the sub-documents matched. You just know
that at least one did match.
-
You can't sort against a field with multiple values.
For instance, if your business has multiple locations, you can't do
a search like:
- find businesses of type X
- within 10 km of Lat,Lon
- sort by distance from Lat,Lon
because all ES knows is that Business 123 matched the search.It
doesn't know which location within Business 123 matched, so it
doesn't know which location to use for sorting.
So you have a few choices, depending on how you want to return your
results.
First, Business Type should just be a field in your Business doc. You
could even have multiple business types in a single field. This field is
just being used as a filter, not to sort by, so you could have:
business_type: ['taxi','limousine','truck_hire']
Then, you need to decide if you are going to incorporate "distance from
point Lat,Lon" into your sort parameters.
If no, then the simplest thing would be to have one doc per business:
{
name: "Joe Bloggs Transport,
business_type: ['taxi','limousine','truck_hire'],
main_tel: "0123 456 789",
main_email: "joe@bloggs.com",
main_address: "1 Main Street, Mysteryville",
locations: [[41,-71],[42,-70]]
}
Or perhaps each location needs its own address and contact details:
{
name: "Joe Bloggs Transport,
business_type: ['taxi','limousine','truck_hire'],
main_tel: "0123 456 789",
main_email: "joe@bloggs.com",
main_address: "1 Main Street, Mysteryville",
locations: [
{
email: "main_branch@bloggs.come",
address: "1 Main Street, Mysteryville",
tel: "0123 456 789",
location: [41,-71],
},
{
email: "sub_branch@bloggs.come",
address: "1 Other Street, Enigmaville",
tel: "0234 567 891",
location: [42,-70],
}
]
}
If you want these locations to be searchable independently (ie match
"other street" and "enigmaville", but not "other street" and
"mysteryville", then these locations should be marked as 'nested'.
Alternatively, if you want to include the "distance from Lat,Lon" in the
sorting, then you need to know which location matches. In other words,
the doc you search on should have a single location only.
Then you need to ask yourself: if two locations for the same business
are returned in the search results, then do I want to show both, or just
show the single business?
In ES, we don't have "field collapsing", in other words you can't say
"if we have two docs with the same business ID, then collapse them into
a single result". So if you want to combine the two matching locations
into a single result, then you will need to do this client side.
Again, the simplest thing would be to have a single doc type:
Business Location, which contains all the info for each business plus
the location.
However, if your Business object has lots of information which isn't
searchable, and so you don't want to repeat it for every location, then
you could index your Business separately from the multiple Business
Location docs, and include in your Location docs everything that needs
to be searchable (plus a business_id)
Then, one you get your location results, you can look up the matching
Business docs in a second step (eg using mget).
hope this makes sense
clint