I am reviewing ES to replace our current setup of homegrown indexing built
with Riak + Riak Search. Our current setup is not exactly epic fail, but
definitely not giving us what we need. And of course, I am trying to build
this while a current live application already exists, so it had to be did
My major requirements are:
- Modeling a one to many relationship. Think of it as Books + Paragraph.
Books have attributes as Author, PublishTime (year+month), Title, Amazon
link. Paragraphs belong to a book and have words, numerical score, start
and end position within the book.
Book (Author, Date, Title, Url, Content)
|--> Paragraph (Words, Score, StartPosition, EndPosition)
This is not the true model but a very close analogy.
Performing computations (such average score for words) based on date,
with the result grouped by book. So essentially, its aggregating the score
from the paragraphs and grouping by book. Also finding, "other" words that
are significant based on current date and word query.
Scale: Currently there are millions of 'book' objects, which could grow
to billions or more. We could have to report on 10 million book objects at
one time, which could involve aggregating 100 million paragraph objects
using some computation. I am hoping to use statistical facets and/or
scripting to move this to ES instead of transporting it to the app level
and doing aggregations using Java lists.
My main questions:
Do you think ES is a good fit for search+compute? Will the statistical
facet work for my requirements?
Should I model the data as Parent/Child or Nested Documents? I need to
search by Parent, but aggregate by Child. Based on my reading of the
reference material and this forum, that is not possible using the
Parent/Child schema since the only thing available is the has_child filter
which does not allow you to do computations on the children, just filter by
Will ES be able to handle this scale?
Are the computations I need to perform possible using the stastical
facets and/or scripting?
Please do get back to me quickly, so I know I am making the right design
decisions from the get go.