hey all,
this is my first post, i hope i don't sound lazy; i've done a lot of
searching and a lot of experimenting with ES trying to find an
answer. i'm coming from minimal use with Endeca and loving ES so far
(awesome product!).
i'm trying to come up with the best way to organize/index and search
my data model, for which an Author Name, Book Title, & Chapter Title
model is perfectly analogous. (obviously, the types have other meta
data, but these text fields are most important right now).
i have roughly 600k authors, 1.5M books, and 13M chapters. primary
user search will be one text input box and i want to return all types
of results (authors, books, chapters) together, sorted by "most
relevant".
so searching "emily bronte wuthering" or "heights bronte" should
return the book "wuthering heights" by "emily bronte" first, and some
or all of its chapters and the author in the results too. or, an
example with the chapter title: "catherine becomes a lady wuthering
heights" yielding that chapter, then the book + author.
that is to say (different ordered permutations of author terms, book
tern, and chapter terms should yield same results, btw):
phrases of: should yied:
author + book => that book first, its author and chapters hits as
well plus any other close matches.
author + chapter => that chapter first, its owning book and its
author hits as well plus any other close matches.
book + chapter => that chapter first, its owning book and its author
hits as well plus any other close matches.
chapter => that chapter first, its owning book and its author hits
as well plus any other close matches.
i hope i'm making sense.
i can think of three ways to index:
1.) nested: artist contains books contains chapters
2.) parent/child: artist parent of book, book parent of chapter (can
you have multiple level ancestry?)
3.) completely flat, with a lot of data duplication (every chapter
type stores author name and book title, every book type stores author
name[and possibly an array of chapter titles?], and the author type
has nothing else [or a list of book titles, and a list of chapter
titles?])
which of these, if any, should i go with? i've started messing around
with an attempt at #1 because it seems cleanest and i don't care about
having to re-index the entire doc, as long as it will perform fast
enough and i can bubble up all 3 result types (not just author…),
ideally with highlighting too… honestly, i'm not totally sure how to
approach the other scenarios. (i did something similar to #3 with
Endeca, but it never worked well at all).
hopefully i'm lucky enough to get an answer on this (would be hugely
appreciated), but i've also started thinking about the query… i'm
guessing i'll want to multi-field all of these, with edgeNGram
analyzers (so i can auto complete / suggest) and i think shingles as
well (seemed like a good idea since the user will be supplying
fragments of (or the entire phrase) each type. (furthermore: i'm
noticing "emily bronte wuthering heights" is ranking books about the
book i want (e.g. "a closer look at emily bronte's wuthering heights"
by "john doe") over the book i'm after ("wuthering heights" by "emily
bronte").
thank you so much for your help!
-j