Disabling _source field

Sergio_Bossa · March 17, 2010, 9:14am

Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

kimchy · March 17, 2010, 7:37pm

Hi Sergio,

I added the ability to disable the source field:
Mapper: Ability to disable storing the "source" field · Issue #66 · elastic/elasticsearch · GitHub. But, I
strongly believe that most of the times, you would want to enable it. Let me
explain why:

When searching, you usually want to display data as part of the hits. That
data can easily be extracted from the source field which is the json
document that was indexed (instead of picking and choosing specific fields
to be stored).

Even when elasticsearch is used with systems like Terrastore, which also
stores the json document, I believe that it makes sense to store the json in
elasticsearch "source" field as well. The main reason is simply performance.
While you do pay in index size and indexing time, you can never fetch the
source field faster then when you already are in the node that stores it
(collocation), not talking about it already being distributed search. If all
that is returned from the search results are ids, then you need, for each
hit, to go and fetch it from Terrastore, and you have just increased the
overhead of your search requests and general overhead of your system.

-shay.banon

On Wed, Mar 17, 2010 at 11:14 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Lukas_Vlcek1 · March 17, 2010, 8:28pm

How about if I would like to implement search on top of books or simply
documents having very large source? And I would like to display just
fractions of the source in the result page for each relevant document...
then I would welcome to have some flexibility in telling ES how much of the
content (source) should be retrieved from the index. May be I am mixing this
with highlighting feature (which is probably already in TODO list) but
still... I think it can be useful to have a way how to tell ES how much of
the source should be returned (one option could be giving some XPath
expression to trim/filter the source or the like)... does it sound like a
stupid idea?

Lukas

On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi Sergio,

I added the ability to disable the source field:
Mapper: Ability to disable storing the "source" field · Issue #66 · elastic/elasticsearch · GitHub. But, I
strongly believe that most of the times, you would want to enable it. Let me
explain why:

When searching, you usually want to display data as part of the hits.
That data can easily be extracted from the source field which is the json
document that was indexed (instead of picking and choosing specific fields
to be stored).

Even when elasticsearch is used with systems like Terrastore, which also
stores the json document, I believe that it makes sense to store the json in
elasticsearch "source" field as well. The main reason is simply performance.
While you do pay in index size and indexing time, you can never fetch the
source field faster then when you already are in the node that stores it
(collocation), not talking about it already being distributed search. If all
that is returned from the search results are ids, then you need, for each
hit, to go and fetch it from Terrastore, and you have just increased the
overhead of your search requests and general overhead of your system.

-shay.banon

On Wed, Mar 17, 2010 at 11:14 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

kimchy · March 17, 2010, 8:36pm

No its not at all. I think what you ask for is mostly covered by
highlighting (and, when searching, you can pass an empty array of fields, in
such a case, the source field would not be returned). With highlighting, you
will be able to get interesting fragments of what you searched for (but, you
would still need to store something to be able to highlight it...).

-shay.banon

On Wed, Mar 17, 2010 at 10:28 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

How about if I would like to implement search on top of books or simply
documents having very large source? And I would like to display just
fractions of the source in the result page for each relevant document...
then I would welcome to have some flexibility in telling ES how much of the
content (source) should be retrieved from the index. May be I am mixing this
with highlighting feature (which is probably already in TODO list) but
still... I think it can be useful to have a way how to tell ES how much of
the source should be returned (one option could be giving some XPath
expression to trim/filter the source or the like)... does it sound like a
stupid idea?

Lukas

On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi Sergio,

I added the ability to disable the source field:
Mapper: Ability to disable storing the "source" field · Issue #66 · elastic/elasticsearch · GitHub. But, I
strongly believe that most of the times, you would want to enable it. Let me
explain why:

When searching, you usually want to display data as part of the hits.
That data can easily be extracted from the source field which is the json
document that was indexed (instead of picking and choosing specific fields
to be stored).

Even when elasticsearch is used with systems like Terrastore, which
also stores the json document, I believe that it makes sense to store the
json in elasticsearch "source" field as well. The main reason is simply
performance. While you do pay in index size and indexing time, you can never
fetch the source field faster then when you already are in the node that
stores it (collocation), not talking about it already being distributed
search. If all that is returned from the search results are ids, then you
need, for each hit, to go and fetch it from Terrastore, and you have just
increased the overhead of your search requests and general overhead of your
system.

-shay.banon

On Wed, Mar 17, 2010 at 11:14 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

Hi Shay,

weeks ago we talked about the possibility to disable the "_source"
field, in order to avoid storing a verbatim copy of the indexed
document inside the Lucene index: did you think about that?
What about adding such a feature in 0.6.0?
I may help with the implementation as well, if you'd want to.

Thanks,
Cheers,

Sergio B,

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Sergio_Bossa · March 17, 2010, 9:53pm

On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

I added the ability to disable the source
field: Mapper: Ability to disable storing the "source" field · Issue #66 · elastic/elasticsearch · GitHub.

You rock

But, I
strongly believe that most of the times, you would want to enable it. Let me
explain why:
When searching, you usually want to display data as part of the hits. That
data can easily be extracted from the source field which is the json
document that was indexed (instead of picking and choosing specific fields
to be stored).
Even when elasticsearch is used with systems like Terrastore, which also
stores the json document, I believe that it makes sense to store the json in
elasticsearch "source" field as well. The main reason is simply performance.

I know the performance argument, but I think it's more important, when
you start to get more and more data, to have separated stores for
documents and indexes: this will help maintain the Lucene index as
lightweight as possible, and have a unique access point for documents
(be it Terrastore, Cassandra or whatever).
The performance penalty caused by the different network hits will be
IMHO paid off by the higher throughput of having two different
distributed entities independently deployed and independently working.

Thanks for the great work!
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

kimchy · March 17, 2010, 9:58pm

Not sure that I agree regarding the higher throughput argument, but, in any
case, its there for people to use it :).

-shay.banon

On Wed, Mar 17, 2010 at 11:53 PM, Sergio Bossa sergio.bossa@gmail.comwrote:

On Wed, Mar 17, 2010 at 8:37 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

I added the ability to disable the source
field: Mapper: Ability to disable storing the "source" field · Issue #66 · elastic/elasticsearch · GitHub.

You rock

But, I
strongly believe that most of the times, you would want to enable it. Let
me
explain why:
When searching, you usually want to display data as part of the hits.
That
data can easily be extracted from the source field which is the json
document that was indexed (instead of picking and choosing specific
fields
to be stored).
Even when elasticsearch is used with systems like Terrastore, which
also
stores the json document, I believe that it makes sense to store the json
in
elasticsearch "source" field as well. The main reason is simply
performance.

I know the performance argument, but I think it's more important, when
you start to get more and more data, to have separated stores for
documents and indexes: this will help maintain the Lucene index as
lightweight as possible, and have a unique access point for documents
(be it Terrastore, Cassandra or whatever).
The performance penalty caused by the different network hits will be
IMHO paid off by the higher throughput of having two different
distributed entities independently deployed and independently working.

Thanks for the great work!
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

egaumer · March 17, 2010, 10:16pm

On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Not sure that I agree regarding the higher throughput argument, but, in any
case, its there for people to use it :).

In all fairness, search (typically) returns references to actual resources
(look at Google). With this in mind, Sergio's argument is a valid one. One
of the common pitfalls of enterprise search projects is that folks want to
use the search index to house complete documents. This has many drawbacks
especially with regards to volatile data. Of course, the tight integration
between terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a (searchable)
key/value store by returning the entire document. I think it expands the
possibilities for different use cases. I've actually used it as a full
fledged data store for customer information that (until recently) was housed
in a large unwieldy spreadsheet.

The ability to disable this feature means users can decide what makes the
most sense for their particular use case.

Regards,
-Eric

Sergio_Bossa · March 17, 2010, 10:22pm

On Wed, Mar 17, 2010 at 11:16 PM, Eric Gaumer egaumer@gmail.com wrote:

Of course, the tight integration
between terrastore and elasticsearch invalidate some of the concerns.

Thanks for sharing your thoughts, Eric.
Do you mind elaborating more on that?

Thanks again,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

kimchy · March 17, 2010, 10:35pm

Couldn't agree more. In terms of usability, the aim of elasticsearch is to
be the best solution out of the box, and the most configurable one when
needed. Its really up to the users.

As a side note, let me explain why storing the source field might make sense
in certain features. Lets say I want to expose an API that allows to reindex
an index into a new index. If elasticsearch has the source documents, then
this API can be implemented easily within elasticsearch. If the source is
not there, then elasticsearch can't really provide this API, and the user
would need to "refetch" the data from another data store, and index it. The
simplicity of the first solution is something that I really like, but the
user can choose. If source is not enabled, then the API will simply bail.

There are other cases where it would be nice to have the actual content of a
field (and not its analyzed form) without the user having to explicitly
"store" it. But most of them are solved by the user cherry picking which
fields to store.

-shay.banon

On Thu, Mar 18, 2010 at 12:16 AM, Eric Gaumer egaumer@gmail.com wrote:

On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Not sure that I agree regarding the higher throughput argument, but, in
any case, its there for people to use it :).

In all fairness, search (typically) returns references to actual resources
(look at Google). With this in mind, Sergio's argument is a valid one. One
of the common pitfalls of enterprise search projects is that folks want to
use the search index to house complete documents. This has many drawbacks
especially with regards to volatile data. Of course, the tight integration
between terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a (searchable)
key/value store by returning the entire document. I think it expands the
possibilities for different use cases. I've actually used it as a full
fledged data store for customer information that (until recently) was housed
in a large unwieldy spreadsheet.

The ability to disable this feature means users can decide what makes the
most sense for their particular use case.

Regards,
-Eric

Sergio_Bossa · March 17, 2010, 10:45pm

Shay, what about implementing a pluggable API to externally lookup the
document?

Sergio Bossa
Sent by iPhone

Il giorno 17/mar/2010, alle ore 23.35, Shay Banon <shay.banon@elasticsearch.com

ha scritto:

Couldn't agree more. In terms of usability, the aim of elasticsearch
is to be the best solution out of the box, and the most configurable
one when needed. Its really up to the users.

As a side note, let me explain why storing the source field might
make sense in certain features. Lets say I want to expose an API
that allows to reindex an index into a new index. If elasticsearch
has the source documents, then this API can be implemented easily
within elasticsearch. If the source is not there, then elasticsearch
can't really provide this API, and the user would need to "refetch"
the data from another data store, and index it. The simplicity of
the first solution is something that I really like, but the user can
choose. If source is not enabled, then the API will simply bail.

There are other cases where it would be nice to have the actual
content of a field (and not its analyzed form) without the user
having to explicitly "store" it. But most of them are solved by the
user cherry picking which fields to store.

-shay.banon

On Thu, Mar 18, 2010 at 12:16 AM, Eric Gaumer egaumer@gmail.com
wrote:
On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:
Not sure that I agree regarding the higher throughput argument, but,
in any case, its there for people to use it :).

In all fairness, search (typically) returns references to actual
resources (look at Google). With this in mind, Sergio's argument is
a valid one. One of the common pitfalls of enterprise search
projects is that folks want to use the search index to house
complete documents. This has many drawbacks especially with regards
to volatile data. Of course, the tight integration between
terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a
(searchable) key/value store by returning the entire document. I
think it expands the possibilities for different use cases. I've
actually used it as a full fledged data store for customer
information that (until recently) was housed in a large unwieldy
spreadsheet.

The ability to disable this feature means users can decide what
makes the most sense for their particular use case.

Regards,
-Eric

kimchy · March 17, 2010, 10:51pm

That is certainly possible, just open an issue for that.

-shay.banon

On Thu, Mar 18, 2010 at 12:45 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

Shay, what about implementing a pluggable API to externally lookup the
document?

Sergio Bossa
Sent by iPhone

Il giorno 17/mar/2010, alle ore 23.35, Shay Banon <
shay.banon@elasticsearch.com> ha scritto:

Couldn't agree more. In terms of usability, the aim of elasticsearch is to
be the best solution out of the box, and the most configurable one when
needed. Its really up to the users.

As a side note, let me explain why storing the source field might make
sense in certain features. Lets say I want to expose an API that allows to
reindex an index into a new index. If elasticsearch has the source
documents, then this API can be implemented easily within elasticsearch. If
the source is not there, then elasticsearch can't really provide this API,
and the user would need to "refetch" the data from another data store, and
index it. The simplicity of the first solution is something that I really
like, but the user can choose. If source is not enabled, then the API will
simply bail.

There are other cases where it would be nice to have the actual content of
a field (and not its analyzed form) without the user having to explicitly
"store" it. But most of them are solved by the user cherry picking which
fields to store.

-shay.banon

On Thu, Mar 18, 2010 at 12:16 AM, Eric Gaumer < egaumer@gmail.com
egaumer@gmail.com> wrote:

On Wed, Mar 17, 2010 at 5:58 PM, Shay Banon <shay.banon@elasticsearch.com
shay.banon@elasticsearch.com> wrote:

Not sure that I agree regarding the higher throughput argument, but, in
any case, its there for people to use it :).

In all fairness, search (typically) returns references to actual resources
(look at Google). With this in mind, Sergio's argument is a valid one. One
of the common pitfalls of enterprise search projects is that folks want to
use the search index to house complete documents. This has many drawbacks
especially with regards to volatile data. Of course, the tight integration
between terrastore and elasticsearch invalidate some of the concerns.

With that said, I really like how elasticsearch emulates a (searchable)
key/value store by returning the entire document. I think it expands the
possibilities for different use cases. I've actually used it as a full
fledged data store for customer information that (until recently) was housed
in a large unwieldy spreadsheet.

The ability to disable this feature means users can decide what makes the
most sense for their particular use case.

Regards,
-Eric

egaumer · March 17, 2010, 11:36pm

On Wed, Mar 17, 2010 at 6:22 PM, Sergio Bossa sergio.bossa@gmail.comwrote:

On Wed, Mar 17, 2010 at 11:16 PM, Eric Gaumer egaumer@gmail.com wrote:

Of course, the tight integration
between terrastore and elasticsearch invalidate some of the concerns.

Thanks for sharing your thoughts, Eric.
Do you mind elaborating more on that?

In terms of enterprise search, roughly 80% of the project time is spent on
document ingest. You've got to aggregate content from disparate sources like
relational databases, content management systems, mail servers, file
servers, file systems, web servers, web services, etc. You're typically
talking about hundreds of millions of documents ranging in all sorts of
formats.

Organizations spend millions of dollars on trying to leverage search to
"unify" their data architecture and it's difficult, expensive, and tends to
lead to fragile one off solutions that are a nightmare to maintain. To make
matters worse, they want their enterprise development teams to be able to
build applications against the search platform. In doing so they want to
index complete documents to avoid having to make an additional network call
out to the legacy system containing the actual resource.

The problem with this scenario is that data is (typically) quite volatile.
When you rely on getting complete documents straight from a search index,
you end up with tight coupling of the resource. When I do a Google search
for "linux" I might get back a result pointing to kernel.org. If
kernel.orgmakes changes to the site (i.e, the resource), my result
(reference) still
points to the latest version. This is a core principle of REST.

When an enterprise organization insists on building applications against
fully indexed documents (i.e., the source), they suffer from synchronization
problems at the presentation layer. Changes on the original data source are
often not reflected in the application. When they realize this (or you make
them realize it) the most common response is "real time indexing". It's very
difficult to achieve this even when the search engine supports it. Why?
Because you're dealing with large volumes of data that span the globe in
some cases and it's all held together by these fragile ingest architectures.

The end result is lots of unhappy folks from stake holders to managers, to
engineers, to end users.

So to elaborate on my original comment, when you can tightly integrate
search as a layer of the data storage "stack", you get this relatively
seamless synchronization between the resource and the references in the
index. When a user updates a document, the storage system ensures the index
is also updated to reflect the changes. From what I've read, this is exactly
the relationship between terrastore and elasticsearch.

I've built search architectures for Comcast, IBM, Disney, Financial Times,
Dow Jones, S&P, Associated Press, Thomson/Reuters, and Citi-Group, just to
name a few. This type of integration addresses a huge need and that's what
really interests me most about elasticsearch (the schema free nature and the
elasticity).

The only problem (and this has nothing to do with elasticsearch) is that
these legacy systems aren't going away anytime soon. We'll be dealing with
poorly implemented enterprise data architectures for years to come. The
bright side is that new start ups can be built around these new ideas and
pave the way for more intelligent data architectures.

Regards,
-Eric

egaumer · March 17, 2010, 11:48pm

On Wed, Mar 17, 2010 at 6:35 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Couldn't agree more. In terms of usability, the aim of elasticsearch is to
be the best solution out of the box, and the most configurable one when
needed. Its really up to the users.

As a side note, let me explain why storing the source field might make
sense in certain features. Lets say I want to expose an API that allows to
reindex an index into a new index. If elasticsearch has the source
documents, then this API can be implemented easily within elasticsearch. If
the source is not there, then elasticsearch can't really provide this API,
and the user would need to "refetch" the data from another data store, and
index it. The simplicity of the first solution is something that I really
like, but the user can choose. If source is not enabled, then the API will
simply bail.

There are other cases where it would be nice to have the actual content of
a field (and not its analyzed form) without the user having to explicitly
"store" it. But most of them are solved by the user cherry picking which
fields to store.

I completely agree. I think there are valid use cases on both sides. Search
is so pervasive that there is no way to comprehend all possible uses. I
think elasticsearch is one of the most flexible solutions I've come across.
Yes there are missing features but the important thing is it's built on an
intelligent core. Features will eventually be implemented, it's just a
matter of time and community.

I'm guessing that the term "elastic" in elasticsearch is meant to symbolize
the distributed nature of the system. I think it also symbolizes the
flexibility of the system in terms of configuration and overall use. I mean
honestly, I've indexed all sorts of content with elasticsearch and I've
never had to edit/open a configuration file. That's impressive considering
I've had to design some pretty elaborate index schemas in the past, using
other products.

Regards,
-Eric

egaumer · March 18, 2010, 2:55am

On Wed, Mar 17, 2010 at 4:36 PM, Shay Banon shay.banon@elasticsearch.comwrote:

No its not at all. I think what you ask for is mostly covered by
highlighting (and, when searching, you can pass an empty array of fields, in
such a case, the source field would not be returned). With highlighting, you
will be able to get interesting fragments of what you searched for (but, you
would still need to store something to be able to highlight it...).

After a long debate, Lucene and Solr are officially merging. What makes this
interesting for ES is that we'll see some of the Solr features (faceting,
highlighting, etc.) become part of Lucene itself. This should ease the
burden of getting things like highlighting into elasticsearch.

Regards,
-Eric

Lukas_Vlcek1 · March 18, 2010, 7:42am

Well... I am not that familiar with Solr guts but my fear is that some of
its functionality implementations will not fit directly with ES architecture
path. We'll see. But anyway, it is good that Lucene-Solr developers are
joining forces.

On Thu, Mar 18, 2010 at 3:55 AM, Eric Gaumer egaumer@gmail.com wrote:

On Wed, Mar 17, 2010 at 4:36 PM, Shay Banon shay.banon@elasticsearch.comwrote:

No its not at all. I think what you ask for is mostly covered by
highlighting (and, when searching, you can pass an empty array of fields, in
such a case, the source field would not be returned). With highlighting, you
will be able to get interesting fragments of what you searched for (but, you
would still need to store something to be able to highlight it...).

After a long debate, Lucene and Solr are officially merging. What makes
this interesting for ES is that we'll see some of the Solr features
(faceting, highlighting, etc.) become part of Lucene itself. This should
ease the burden of getting things like highlighting into elasticsearch.

Regards,
-Eric

kimchy · March 18, 2010, 8:16am

Actually, in both cases, they can be implemented by elasticsearch without
Solr. Highlighting is slowly taking form as we speak, and you already have
query facets in elasticsearch

As for the merger, I have mixed feelings about it. If they do hold to their
promise, and keep a lucene "core" and lucene "modules" separated from Solr,
then it will be good. I think I know why the merge is happening, and sadly
it probably has nothing to do with pure software.... .

-shay.banon

On Thu, Mar 18, 2010 at 4:55 AM, Eric Gaumer egaumer@gmail.com wrote:

On Wed, Mar 17, 2010 at 4:36 PM, Shay Banon shay.banon@elasticsearch.comwrote:

No its not at all. I think what you ask for is mostly covered by
highlighting (and, when searching, you can pass an empty array of fields, in
such a case, the source field would not be returned). With highlighting, you
will be able to get interesting fragments of what you searched for (but, you
would still need to store something to be able to highlight it...).

After a long debate, Lucene and Solr are officially merging. What makes
this interesting for ES is that we'll see some of the Solr features
(faceting, highlighting, etc.) become part of Lucene itself. This should
ease the burden of getting things like highlighting into elasticsearch.

Regards,
-Eric

Sergio_Bossa · March 18, 2010, 9:46am

On Wed, Mar 17, 2010 at 11:51 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

That is certainly possible, just open an issue for that.

Done: Issues · elastic/elasticsearch · GitHub

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Sergio_Bossa · March 18, 2010, 9:52am

Really great thoughts, and I couldn't agree more: they deserve a whole
(blog) post by their own, if you'd decided to write one do not
hesitate to let us know

Talking about Terrastore/Elasticsearch integration, the answer is yes,
it aims to provide an integrated store/search experience: it's in
early stages, but basic features are there.

Thanks again for sharing,
Cheers,

Sergio B.

On Thu, Mar 18, 2010 at 12:36 AM, Eric Gaumer egaumer@gmail.com wrote:

On Wed, Mar 17, 2010 at 6:22 PM, Sergio Bossa sergio.bossa@gmail.com
wrote:

On Wed, Mar 17, 2010 at 11:16 PM, Eric Gaumer egaumer@gmail.com wrote:

Of course, the tight integration
between terrastore and elasticsearch invalidate some of the concerns.

Thanks for sharing your thoughts, Eric.
Do you mind elaborating more on that?

In terms of enterprise search, roughly 80% of the project time is spent on
document ingest. You've got to aggregate content from disparate sources like
relational databases, content management systems, mail servers, file
servers, file systems, web servers, web services, etc. You're typically
talking about hundreds of millions of documents ranging in all sorts of
formats.
Organizations spend millions of dollars on trying to leverage search to
"unify" their data architecture and it's difficult, expensive, and tends to
lead to fragile one off solutions that are a nightmare to maintain. To make
matters worse, they want their enterprise development teams to be able to
build applications against the search platform. In doing so they want to
index complete documents to avoid having to make an additional network call
out to the legacy system containing the actual resource.
The problem with this scenario is that data is (typically) quite volatile.
When you rely on getting complete documents straight from a search index,
you end up with tight coupling of the resource. When I do a Google search
for "linux" I might get back a result pointing to kernel.org. If kernel.org
makes changes to the site (i.e, the resource), my result (reference) still
points to the latest version. This is a core principle of REST.
When an enterprise organization insists on building applications against
fully indexed documents (i.e., the source), they suffer from synchronization
problems at the presentation layer. Changes on the original data source are
often not reflected in the application. When they realize this (or you make
them realize it) the most common response is "real time indexing". It's very
difficult to achieve this even when the search engine supports it. Why?
Because you're dealing with large volumes of data that span the globe in
some cases and it's all held together by these fragile ingest architectures.
The end result is lots of unhappy folks from stake holders to managers, to
engineers, to end users.
So to elaborate on my original comment, when you can tightly integrate
search as a layer of the data storage "stack", you get this relatively
seamless synchronization between the resource and the references in the
index. When a user updates a document, the storage system ensures the index
is also updated to reflect the changes. From what I've read, this is exactly
the relationship between terrastore and elasticsearch.
I've built search architectures for Comcast, IBM, Disney, Financial Times,
Dow Jones, S&P, Associated Press, Thomson/Reuters, and Citi-Group, just to
name a few. This type of integration addresses a huge need and that's what
really interests me most about elasticsearch (the schema free nature and the
elasticity).
The only problem (and this has nothing to do with elasticsearch) is that
these legacy systems aren't going away anytime soon. We'll be dealing with
poorly implemented enterprise data architectures for years to come. The
bright side is that new start ups can be built around these new ideas and
pave the way for more intelligent data architectures.
Regards,
-Eric

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Clinton_Gormley · March 18, 2010, 2:20pm

So to elaborate on my original comment, when you can tightly integrate
search as a layer of the data storage "stack", you get this relatively
seamless synchronization between the resource and the references in
the index. When a user updates a document, the storage system ensures
the index is also updated to reflect the changes. From what I've read,
this is exactly the relationship between terrastore and elasticsearch.

Really interesting point!

clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Sergio_Bossa · March 18, 2010, 2:24pm

On Thu, Mar 18, 2010 at 3:20 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:

Really interesting point!

Yes, it is ... so you may want to provide a nice perl API for
Terrastore as well ... okay, that was shameless, please forgive me

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Topic		Replies	Views
HELP! how to disable _source and only allow a couple of fields to be stored Elasticsearch	7	1515	July 6, 2017
Enabling/Disabling _source field query Elasticsearch	6	581	April 2, 2019
How get the document field when disable the _source? Elasticsearch	5	3084	July 5, 2017
Unable to disable the _source field Elasticsearch	6	331	July 6, 2017
Where do I disable _source field? Elasticsearch	6	111	April 17, 2024

Disabling _source field

Related topics