On Wed, Mar 17, 2010 at 6:22 PM, Sergio Bossa sergio.bossa@gmail.comwrote:
On Wed, Mar 17, 2010 at 11:16 PM, Eric Gaumer egaumer@gmail.com wrote:
Of course, the tight integration
between terrastore and elasticsearch invalidate some of the concerns.
Thanks for sharing your thoughts, Eric.
Do you mind elaborating more on that?
In terms of enterprise search, roughly 80% of the project time is spent on
document ingest. You've got to aggregate content from disparate sources like
relational databases, content management systems, mail servers, file
servers, file systems, web servers, web services, etc. You're typically
talking about hundreds of millions of documents ranging in all sorts of
formats.
Organizations spend millions of dollars on trying to leverage search to
"unify" their data architecture and it's difficult, expensive, and tends to
lead to fragile one off solutions that are a nightmare to maintain. To make
matters worse, they want their enterprise development teams to be able to
build applications against the search platform. In doing so they want to
index complete documents to avoid having to make an additional network call
out to the legacy system containing the actual resource.
The problem with this scenario is that data is (typically) quite volatile.
When you rely on getting complete documents straight from a search index,
you end up with tight coupling of the resource. When I do a Google search
for "linux" I might get back a result pointing to kernel.org. If
kernel.orgmakes changes to the site (i.e, the resource), my result
(reference) still
points to the latest version. This is a core principle of REST.
When an enterprise organization insists on building applications against
fully indexed documents (i.e., the source), they suffer from synchronization
problems at the presentation layer. Changes on the original data source are
often not reflected in the application. When they realize this (or you make
them realize it) the most common response is "real time indexing". It's very
difficult to achieve this even when the search engine supports it. Why?
Because you're dealing with large volumes of data that span the globe in
some cases and it's all held together by these fragile ingest architectures.
The end result is lots of unhappy folks from stake holders to managers, to
engineers, to end users.
So to elaborate on my original comment, when you can tightly integrate
search as a layer of the data storage "stack", you get this relatively
seamless synchronization between the resource and the references in the
index. When a user updates a document, the storage system ensures the index
is also updated to reflect the changes. From what I've read, this is exactly
the relationship between terrastore and elasticsearch.
I've built search architectures for Comcast, IBM, Disney, Financial Times,
Dow Jones, S&P, Associated Press, Thomson/Reuters, and Citi-Group, just to
name a few. This type of integration addresses a huge need and that's what
really interests me most about elasticsearch (the schema free nature and the
elasticity).
The only problem (and this has nothing to do with elasticsearch) is that
these legacy systems aren't going away anytime soon. We'll be dealing with
poorly implemented enterprise data architectures for years to come. The
bright side is that new start ups can be built around these new ideas and
pave the way for more intelligent data architectures.
Regards,
-Eric