Hi Atharva,
you are right, streamability of data is a concern. ES nodes transport all
the data over the wire and provides an API for that.
But, consider when it comes to indexing a Lucene Field from a list of
terms. In Lucene 3, you have to add all the terms and fields to the
document, before the document got indexed. It is not possible to get an
advantage by streaming field data, it is just another style of providing
terms for the Lucene analysis.
In Lucene 4, field data is processed by a pluggable codecs. With a custom
codec that is able to consume large term streams efficiently, it could be
possible to benefit from the idea to stream large term streams into a
single Lucene field. But, for creating an inverted index, it is required to
reference to all the document data with all the field in memory at once for
generating term/field/doc statistics for relevance scoring. That is the
price you pay for inverted indexing.
Or your scenario is different. You do not want term indexing, but simply
storing key/value like data where the value is an opaque binary stream. I
think there are also enhancements available in Lucene 4 for key/value style.
Another related issue is JSON combined with binary stream data. ES encodes
binary streams into base64, and offers compression to remedy the overhead
situation.
We will wait and see what improvements ES will offer in 0.21 and later!
Jörg
On Friday, November 9, 2012 10:03:54 AM UTC+1, Atharva Patel wrote:
Can the Text, XContent, ByteReference of elasticsearch be used here to
achieve this?
On Friday, 9 November 2012 13:52:35 UTC+5:30, Atharva Patel wrote:
I have a use case where I need to append certain amount of text to an
existing field of the document which has been already indexed earlier.
After appending I will be reindexing the document. I am predicting the
amount of data which is going to be stored in that single field of my
document will be pretty large so I decided not to store the field as well
as disabled the _source feature on that document type.
I also feel as there will be several indexing/reindexing operations will
be going on simultaneously in the JVM (I am currently using Java API), it
will be highly memory inefficient to bring all the whole field value as
large string object into memory.
I am wondering if the '*streamability' *of the field in document in Java
API can be used in someway to make my use case memory efficient. If yes, I
would like to see a kind of pseudocode or example for steps of stream
operations to perform to achieve the this use case memory efficiently?
--