Author, Book, Chapter analogy - which schema / use case: Nested, Parent / Child, Flat or...?


(Johnny-2) #1

hey all,

this is my first post, i hope i don't sound lazy; i've done a lot of
searching and a lot of experimenting with ES trying to find an
answer. i'm coming from minimal use with Endeca and loving ES so far
(awesome product!).

i'm trying to come up with the best way to organize/index and search
my data model, for which an Author Name, Book Title, & Chapter Title
model is perfectly analogous. (obviously, the types have other meta
data, but these text fields are most important right now).

i have roughly 600k authors, 1.5M books, and 13M chapters. primary
user search will be one text input box and i want to return all types
of results (authors, books, chapters) together, sorted by "most
relevant".

so searching "emily bronte wuthering" or "heights bronte" should
return the book "wuthering heights" by "emily bronte" first, and some
or all of its chapters and the author in the results too. or, an
example with the chapter title: "catherine becomes a lady wuthering
heights" yielding that chapter, then the book + author.

that is to say (different ordered permutations of author terms, book
tern, and chapter terms should yield same results, btw):

phrases of: should yied:
author + book => that book first, its author and chapters hits as
well plus any other close matches.
author + chapter => that chapter first, its owning book and its
author hits as well plus any other close matches.
book + chapter => that chapter first, its owning book and its author
hits as well plus any other close matches.
chapter => that chapter first, its owning book and its author hits
as well plus any other close matches.

i hope i'm making sense.

i can think of three ways to index:
1.) nested: artist contains books contains chapters
2.) parent/child: artist parent of book, book parent of chapter (can
you have multiple level ancestry?)
3.) completely flat, with a lot of data duplication (every chapter
type stores author name and book title, every book type stores author
name[and possibly an array of chapter titles?], and the author type
has nothing else [or a list of book titles, and a list of chapter
titles?])

which of these, if any, should i go with? i've started messing around
with an attempt at #1 because it seems cleanest and i don't care about
having to re-index the entire doc, as long as it will perform fast
enough and i can bubble up all 3 result types (not just author…),
ideally with highlighting too… honestly, i'm not totally sure how to
approach the other scenarios. (i did something similar to #3 with
Endeca, but it never worked well at all).

hopefully i'm lucky enough to get an answer on this (would be hugely
appreciated), but i've also started thinking about the query… i'm
guessing i'll want to multi-field all of these, with edgeNGram
analyzers (so i can auto complete / suggest) and i think shingles as
well (seemed like a good idea since the user will be supplying
fragments of (or the entire phrase) each type. (furthermore: i'm
noticing "emily bronte wuthering heights" is ranking books about the
book i want (e.g. "a closer look at emily bronte's wuthering heights"
by "john doe") over the book i'm after ("wuthering heights" by "emily
bronte").

thank you so much for your help!

-j


(Jörg Prante) #2

Hi,

what you try do to is generally speaking nothing but building a
library catalog. Incidentally, my occupation is building catalogs for
academic libraries with ES, with millions of titles written by
thousands of authors, for hundreds of libraries, in dozens of
languages, with many different publication types, issues, catalog
enrichments, and so on.

My approach would be creating a single index for your data, e.g.
"works". In a "works" index, create an index type "titles", where you
put information about the title words, like "wuthering heights",
together with creator names, like "emily bronte" and so on. In another
index type "books", you can put specific book information, for
example, publisher names, publisher places, dates of publication,
media types, extents. And finally, in a third type "chapter", you put
all the chapter information, the chapter content, the chapter pages,
possibly headlines. If you have a proper data model, you can put a
unique "works" ID in all the index types. Finally, do some index
boosting. Lifting up the "title" index type will help because title
words are common words and can also be frequently found in the
chapters. Most use cases require hits on title words on top of the
result list, just because it is so common to search for title words.

In your application, you can decide whether to search for titles,
books, chapters together or separately, just by selecting or combining
the index type(s). In fact, if you have unique identifiers for your
works, you need not to have much data duplication in your index. Data
duplication in a search engine need not to be a bad thing per se if
you know how your data is organized, i.e. how you manage the data
update in an efficient way. For example, depending on the ID of the
work, you can easily update the whole data, parts of it in the index
type, or just adding more index types. Besides, you may think about
arbitrary navigation over the information about a work by linking or
faceting.

In the beginning, I wouldn't care too much about the ES nesting thing.
It looks attractive but it could be a bit hard to instrument ES to do
all the data modeling for you. As you already noticed, the index
update thing, the managing of field contents, the weighting etc. is
simpler with index types and fields in a linear order.

Jörg

On Jan 12, 7:05 am, Johnny johnnymarn...@gmail.com wrote:

hey all,

this is my first post, i hope i don't sound lazy; i've done a lot of
searching and a lot of experimenting with ES trying to find an
answer. i'm coming from minimal use with Endeca and loving ES so far
(awesome product!).

i'm trying to come up with the best way to organize/index and search
my data model, for which an Author Name, Book Title, & Chapter Title
model is perfectly analogous. (obviously, the types have other meta
data, but these text fields are most important right now).

i have roughly 600k authors, 1.5M books, and 13M chapters. primary
user search will be one text input box and i want to return all types
of results (authors, books, chapters) together, sorted by "most
relevant".

so searching "emily bronte wuthering" or "heights bronte" should
return the book "wuthering heights" by "emily bronte" first, and some
or all of its chapters and the author in the results too. or, an
example with the chapter title: "catherine becomes a lady wuthering
heights" yielding that chapter, then the book + author.

that is to say (different ordered permutations of author terms, book
tern, and chapter terms should yield same results, btw):

phrases of: should yied:
author + book => that book first, its author and chapters hits as
well plus any other close matches.
author + chapter => that chapter first, its owning book and its
author hits as well plus any other close matches.
book + chapter => that chapter first, its owning book and its author
hits as well plus any other close matches.
chapter => that chapter first, its owning book and its author hits
as well plus any other close matches.

i hope i'm making sense.

i can think of three ways to index:
1.) nested: artist contains books contains chapters
2.) parent/child: artist parent of book, book parent of chapter (can
you have multiple level ancestry?)
3.) completely flat, with a lot of data duplication (every chapter
type stores author name and book title, every book type stores author
name[and possibly an array of chapter titles?], and the author type
has nothing else [or a list of book titles, and a list of chapter
titles?])

which of these, if any, should i go with? i've started messing around
with an attempt at #1 because it seems cleanest and i don't care about
having to re-index the entire doc, as long as it will perform fast
enough and i can bubble up all 3 result types (not just author…),
ideally with highlighting too… honestly, i'm not totally sure how to
approach the other scenarios. (i did something similar to #3 with
Endeca, but it never worked well at all).

hopefully i'm lucky enough to get an answer on this (would be hugely
appreciated), but i've also started thinking about the query… i'm
guessing i'll want to multi-field all of these, with edgeNGram
analyzers (so i can auto complete / suggest) and i think shingles as
well (seemed like a good idea since the user will be supplying
fragments of (or the entire phrase) each type. (furthermore: i'm
noticing "emily bronte wuthering heights" is ranking books about the
book i want (e.g. "a closer look at emily bronte's wuthering heights"
by "john doe") over the book i'm after ("wuthering heights" by "emily
bronte").

thank you so much for your help!

-j


(Ævar Arnfjörð Bjarmason) #3

On Fri, Jan 13, 2012 at 09:59, jprante joergprante@gmail.com wrote:

My approach would be creating a single index for your data, e.g.
"works". In a "works" index, create an index type "titles", where you
put information about the title words, like "wuthering heights",
together with creator names, like "emily bronte" and so on. In another
index type "books", you can put specific book information, for
example, publisher names, publisher places, dates of publication,
media types, extents. And finally, in a third type "chapter", you put
all the chapter information, the chapter content, the chapter pages,
possibly headlines. If you have a proper data model, you can put a
unique "works" ID in all the index types. Finally, do some index
boosting. Lifting up the "title" index type will help because title
words are common words and can also be frequently found in the
chapters. Most use cases require hits on title words on top of the
result list, just because it is so common to search for title words.

In your application, you can decide whether to search for titles,
books, chapters together or separately, just by selecting or combining
the index type(s). In fact, if you have unique identifiers for your
works, you need not to have much data duplication in your index. Data
duplication in a search engine need not to be a bad thing per se if
you know how your data is organized, i.e. how you manage the data
update in an efficient way. For example, depending on the ID of the
work, you can easily update the whole data, parts of it in the index
type, or just adding more index types. Besides, you may think about
arbitrary navigation over the information about a work by linking or
faceting.

I'd be curious about more specifics about how you go about
constructing a query that combines the index types.

If I'm understanding you correctly this only works because in your
example you can put unique "works" ids everywhere, then you get a
bunch of results and display all the works that matched either the
author, the chapter contects etc.

I have a nested data issue that I've become convinced I have to solve
with duplication, but maybe I'm wrong. What I'm doing is implemening a
geographic search engine where you can e.g. search for:

London

Which would match entries like:

{
    dest_type: "city",
    dest_id: 12345,
    country: "uk",
    name_en: "London",
    name_it: "Londres"
},
{
    dest_type: "city",
    dest_id: 54321,
    country: "us",
    name_en: "London",
}

I.e. it would both match the London in the UK that we all know, but
also obscure cities in the US.

Now if your search query is:

London, United Kingdom

Or:

King's Cross, London

It should be able to figure out that the UK result is more pertinent
because it's in a country matching "United Kingdom", or that the
"King's Cross" landmark is only in the UK London, not the one in the
US.

In the system I'm working on this logic currently happens is outside
of ES. The algorithm being:

  1. Hope that from the ES results we get both all the "London"
    cities, as well as a separate document for the country "United
    Kingdom", and one for the landmark of "King's Cross".

  2. Compare all these results based on their lat/lon. If a result in
    the set is geographically to another result boost it score up.

Now this sucks because it requires a lot of CPU intensive
post-processing of ES results, and relies on ES returning the entry
for "King's Cross" as well as the entry for "London", which it doesn't
always do (due to result set sizes etc.).

So I was hoping to structure my documents like this instead:

{
    dest_type: "city",
    dest_id: 12345,
    country: "uk",
    name_en: "London",
    name_it: "Londres",
    nearby_landmarks: [
        {
            dest_type: "landmark",
            dest_id: 111,
            name_en: "King's Cross",
        },
        {
            dest_type: "landmark",
            dest_id: 222,
            name_en: "Covent Garden",
        },
    ],
    regions: [
        {
            dest_type: "region",
            dest_id: 444,
            name_en: "Greater London",
        },
        {
            dest_type: "region",
            dest_id: 444,
            name_en: "England",
        },
    ],
    country: [
        {
            dest_type: "country",
            dest_id: 3,
            name_en: "United Kingdom",
        },
    ],
},

And then I'd simply search all the documents for:

name_en^10
name_it^10
country.name_en^5
country.name_it^5
nearby_landmarks.name_en^3
nearby_landmarks.name_it^3
regions.name_en^2
regions.name_it^2

I.e. "get me documents whose name matches the query, but also search
through the country names, nearby landmarks and regions associated
with those documents".

Of course this also means that I have to add the full details (well,
the text I need to search through) of other nearby documents to every
single document. E.g. for a landmark in London I'll be adding other
nearby landmarks, what region it's in, the country etc.

This means that the dataset will be much bigger, but Lucene can also
grind through it using only fulltext scoring and return documents that
are more pertinent to my query, instead of me having to get a larger
sets of documents to from ES and figure out the relations between them
myself.

Is there a better way to do this?

My mental model of Lucene is that it's a document store where each
document is composed of an arbitrary list of key/values. You can
search through all the documents with your search string, but you
can't easily do any sort of "join" type operation.

Thus structuring your data like I've done above would be similar how
you'd structure a travel catalog. Each destination discussed in the
catalog is going to have a lot of duplicated info that other pages
also have (e.g. nearby landmarks), but since that info is there any
full-text search engine can return pertinent entries from the catalog.

The nearby places mentioned on each page of the catalog might also be
valid places of their own, so you can't represent it as a relationship
where one is a sub-field of the other.


(Jörg Prante) #4

Yes, the "work ID" is crucial.

Your search domain looks like it is specific to the area of
geographical names and location-based services. I suggest enter
geonames. The service http://geonames.org is a free geographical
database with millions of entries from all over the world. The search
on their site is powered by Lucene.

The "nearby" function you described is sort of an approximate search
without geo coordinates and feels a bit heavy. Instead, you could add
the geonames database to your index, and do some nifty Elasticsearch
geo queries with filtering and sorting by geo distance. Just prepare
and use the latitude / longitude coordinates.

OK, Elasticsearch will give you now neat results, but what next? For
your travel catalog, you will require some identification markers or
classification of the things nearby, how they are grouped, how they
are connected, which are the "access points" you like to offer to your
users, how to cover interesting things important to your users and so
on.

Geonames is also part of Linked Open Data. Enter semantic web. This is
now real fun. Everything in the semantic web has an URI, and this is
exactly such a crucial ID I mean with "work IDs". URIs are also
globally unique by definition. You can refer to geo entities by URI,
and you can find relationships to and from this URI, and you can rely
on it. The database for doing this exists, just reuse (and improve)
it, the service is free and open. You might discover more resources on
the web, location-based services with relations to a reliable geonames
URI.

Those relationship types you discover could be integrated to your geo
index in a fashion that is equivalent to prepare Elasticsearch index
types by the way. This could help managing the process of structuring
your data. But, as related data changes are more frequent than changes
to core data, index types are not always a clever choice. If you have
many updates or deletions, you could also think about indexes sitting
beside each other and perform round-robin index switching over time.
For Elasticsearch, updating and deleting indexes as a whole is more
efficient than walking through all the documents in the index types.

Just my 2p: be very careful of thinking Lucene / Elasticsearch being a
key/value database. It's a search engine, an inverted (indirect) index
(kind of value/key if you like). Just because Lucene offers many
fields per document and Elasticsearch offers a very convenient JSON
view on your source data and scales easily, search engines are not
doing direct indexing automagically (by using hash or b-tree key/value
indexes). Thinking in relational terms, you can model your data by
putting keys in hierarchical or even in network (graph) relationships.
But, Lucene does not reflect such relationships between keys, it is
strong on full text searching and so still is Elasticsearch. The more
structured the keys in your model are, the higher the price you have
to pay in search engines: large field count, large memory consumption,
long search times, complex queries, heavy sorting on keys and so on.
This is not the area where search engines shine. One important lesson
for putting structured keys into search engines and still having
decent resource usage is this: use as few fields as possible, pack as
much content - even redundant content - into as few fields as
possible, design a suitable rank algorithm (but only if Lucene's built-
in one is not good enough for you) and just let the machine perform
the relevance ranking to get exactly the document lists you want to
display.

Jörg

On Jan 16, 3:00 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

On Fri, Jan 13, 2012 at 09:59, jprante joergpra...@gmail.com wrote:

My approach would be creating a single index for your data, e.g.
"works". In a "works" index, create an index type "titles", where you
put information about the title words, like "wuthering heights",
together with creator names, like "emily bronte" and so on. In another
index type "books", you can put specific book information, for
example, publisher names, publisher places, dates of publication,
media types, extents. And finally, in a third type "chapter", you put
all the chapter information, the chapter content, the chapter pages,
possibly headlines. If you have a proper data model, you can put a
unique "works" ID in all the index types. Finally, do some index
boosting. Lifting up the "title" index type will help because title
words are common words and can also be frequently found in the
chapters. Most use cases require hits on title words on top of the
result list, just because it is so common to search for title words.

In your application, you can decide whether to search for titles,
books, chapters together or separately, just by selecting or combining
the index type(s). In fact, if you have unique identifiers for your
works, you need not to have much data duplication in your index. Data
duplication in a search engine need not to be a bad thing per se if
you know how your data is organized, i.e. how you manage the data
update in an efficient way. For example, depending on the ID of the
work, you can easily update the whole data, parts of it in the index
type, or just adding more index types. Besides, you may think about
arbitrary navigation over the information about a work by linking or
faceting.

I'd be curious about more specifics about how you go about
constructing a query that combines the index types.

If I'm understanding you correctly this only works because in your
example you can put unique "works" ids everywhere, then you get a
bunch of results and display all the works that matched either the
author, the chapter contects etc.

I have a nested data issue that I've become convinced I have to solve
with duplication, but maybe I'm wrong. What I'm doing is implemening a
geographic search engine where you can e.g. search for:

London

Which would match entries like:

{
    dest_type: "city",
    dest_id: 12345,
    country: "uk",
    name_en: "London",
    name_it: "Londres"
},
{
    dest_type: "city",
    dest_id: 54321,
    country: "us",
    name_en: "London",
}

I.e. it would both match the London in the UK that we all know, but
also obscure cities in the US.

Now if your search query is:

London, United Kingdom

Or:

King's Cross, London

It should be able to figure out that the UK result is more pertinent
because it's in a country matching "United Kingdom", or that the
"King's Cross" landmark is only in the UK London, not the one in the
US.

In the system I'm working on this logic currently happens is outside
of ES. The algorithm being:

  1. Hope that from the ES results we get both all the "London"
    cities, as well as a separate document for the country "United
    Kingdom", and one for the landmark of "King's Cross".

  2. Compare all these results based on their lat/lon. If a result in
    the set is geographically to another result boost it score up.

Now this sucks because it requires a lot of CPU intensive
post-processing of ES results, and relies on ES returning the entry
for "King's Cross" as well as the entry for "London", which it doesn't
always do (due to result set sizes etc.).

So I was hoping to structure my documents like this instead:

{
    dest_type: "city",
    dest_id: 12345,
    country: "uk",
    name_en: "London",
    name_it: "Londres",
    nearby_landmarks: [
        {
            dest_type: "landmark",
            dest_id: 111,
            name_en: "King's Cross",
        },
        {
            dest_type: "landmark",
            dest_id: 222,
            name_en: "Covent Garden",
        },
    ],
    regions: [
        {
            dest_type: "region",
            dest_id: 444,
            name_en: "Greater London",
        },
        {
            dest_type: "region",
            dest_id: 444,
            name_en: "England",
        },
    ],
    country: [
        {
            dest_type: "country",
            dest_id: 3,
            name_en: "United Kingdom",
        },
    ],
},

And then I'd simply search all the documents for:

name_en^10
name_it^10
country.name_en^5
country.name_it^5
nearby_landmarks.name_en^3
nearby_landmarks.name_it^3
regions.name_en^2
regions.name_it^2

I.e. "get me documents whose name matches the query, but also search
through the country names, nearby landmarks and regions associated
with those documents".

Of course this also means that I have to add the full details (well,
the text I need to search through) of other nearby documents to every
single document. E.g. for a landmark in London I'll be adding other
nearby landmarks, what region it's in, the country etc.

This means that the dataset will be much bigger, but Lucene can also
grind through it using only fulltext scoring and return documents that
are more pertinent to my query, instead of me having to get a larger
sets of documents to from ES and figure out the relations between them
myself.

Is there a better way to do this?

My mental model of Lucene is that it's a document store where each
document is composed of an arbitrary list of key/values. You can
search through all the documents with your search string, but you
can't easily do any sort of "join" type operation.

Thus structuring your data like I've done above would be similar how
you'd structure a travel catalog. Each destination discussed in the
catalog is going to have a lot of duplicated info that other pages
also have (e.g. nearby landmarks), but since that info is there any
full-text search engine can return pertinent entries from the catalog.

The nearby places mentioned on each page of the catalog might also be
valid places of their own, so you can't represent it as a relationship
where one is a sub-field of the other.


(system) #5