Using the Streamable field value in Java API

I have a use case where I need to append certain amount of text to an
existing field of the document which has been already indexed earlier.
After appending I will be reindexing the document. I am predicting the
amount of data which is going to be stored in that single field of my
document will be pretty large so I decided not to store the field as well
as disabled the _source feature on that document type.

I also feel as there will be several indexing/reindexing operations will be
going on simultaneously in the JVM (I am currently using Java API), it will
be highly memory inefficient to bring all the whole field value as large
string object into memory.

I am wondering if the '*streamability' *of the field in document in Java
API can be used in someway to make my use case memory efficient. If yes, I
would like to see a kind of pseudocode or example for steps of stream
operations to perform to achieve the this use case memory efficiently?

--

Can the Text, XContent, ByteReference of elasticsearch be used here to
achieve this?

On Friday, 9 November 2012 13:52:35 UTC+5:30, Atharva Patel wrote:

I have a use case where I need to append certain amount of text to an
existing field of the document which has been already indexed earlier.
After appending I will be reindexing the document. I am predicting the
amount of data which is going to be stored in that single field of my
document will be pretty large so I decided not to store the field as well
as disabled the _source feature on that document type.

I also feel as there will be several indexing/reindexing operations will
be going on simultaneously in the JVM (I am currently using Java API), it
will be highly memory inefficient to bring all the whole field value as
large string object into memory.

I am wondering if the '*streamability' *of the field in document in Java
API can be used in someway to make my use case memory efficient. If yes, I
would like to see a kind of pseudocode or example for steps of stream
operations to perform to achieve the this use case memory efficiently?

--

Hi Atharva,

you are right, streamability of data is a concern. ES nodes transport all
the data over the wire and provides an API for that.

But, consider when it comes to indexing a Lucene Field from a list of
terms. In Lucene 3, you have to add all the terms and fields to the
document, before the document got indexed. It is not possible to get an
advantage by streaming field data, it is just another style of providing
terms for the Lucene analysis.

In Lucene 4, field data is processed by a pluggable codecs. With a custom
codec that is able to consume large term streams efficiently, it could be
possible to benefit from the idea to stream large term streams into a
single Lucene field. But, for creating an inverted index, it is required to
reference to all the document data with all the field in memory at once for
generating term/field/doc statistics for relevance scoring. That is the
price you pay for inverted indexing.

Or your scenario is different. You do not want term indexing, but simply
storing key/value like data where the value is an opaque binary stream. I
think there are also enhancements available in Lucene 4 for key/value style.

Another related issue is JSON combined with binary stream data. ES encodes
binary streams into base64, and offers compression to remedy the overhead
situation.

We will wait and see what improvements ES will offer in 0.21 and later!

Jörg

On Friday, November 9, 2012 10:03:54 AM UTC+1, Atharva Patel wrote:

Can the Text, XContent, ByteReference of elasticsearch be used here to
achieve this?

On Friday, 9 November 2012 13:52:35 UTC+5:30, Atharva Patel wrote:

I have a use case where I need to append certain amount of text to an
existing field of the document which has been already indexed earlier.
After appending I will be reindexing the document. I am predicting the
amount of data which is going to be stored in that single field of my
document will be pretty large so I decided not to store the field as well
as disabled the _source feature on that document type.

I also feel as there will be several indexing/reindexing operations will
be going on simultaneously in the JVM (I am currently using Java API), it
will be highly memory inefficient to bring all the whole field value as
large string object into memory.

I am wondering if the '*streamability' *of the field in document in Java
API can be used in someway to make my use case memory efficient. If yes, I
would like to see a kind of pseudocode or example for steps of stream
operations to perform to achieve the this use case memory efficiently?

--

The ability to stream data into Lucene is fairly unexplored I think but in
Lucene 4 there is probably enough flexibility for it now. The notion of
Documents and Fields are heavily abstracted and the only requirement for
indexing content is to be able to generate a TokenStream. I hope it's
something we can look into more in future ES versions.

On Saturday, November 10, 2012 3:12:48 AM UTC+13, Jörg Prante wrote:

Hi Atharva,

you are right, streamability of data is a concern. ES nodes transport all
the data over the wire and provides an API for that.

But, consider when it comes to indexing a Lucene Field from a list of
terms. In Lucene 3, you have to add all the terms and fields to the
document, before the document got indexed. It is not possible to get an
advantage by streaming field data, it is just another style of providing
terms for the Lucene analysis.

In Lucene 4, field data is processed by a pluggable codecs. With a custom
codec that is able to consume large term streams efficiently, it could be
possible to benefit from the idea to stream large term streams into a
single Lucene field. But, for creating an inverted index, it is required to
reference to all the document data with all the field in memory at once for
generating term/field/doc statistics for relevance scoring. That is the
price you pay for inverted indexing.

Or your scenario is different. You do not want term indexing, but simply
storing key/value like data where the value is an opaque binary stream. I
think there are also enhancements available in Lucene 4 for key/value style.

Another related issue is JSON combined with binary stream data. ES encodes
binary streams into base64, and offers compression to remedy the overhead
situation.

We will wait and see what improvements ES will offer in 0.21 and later!

Jörg

On Friday, November 9, 2012 10:03:54 AM UTC+1, Atharva Patel wrote:

Can the Text, XContent, ByteReference of elasticsearch be used here to
achieve this?

On Friday, 9 November 2012 13:52:35 UTC+5:30, Atharva Patel wrote:

I have a use case where I need to append certain amount of text to an
existing field of the document which has been already indexed earlier.
After appending I will be reindexing the document. I am predicting the
amount of data which is going to be stored in that single field of my
document will be pretty large so I decided not to store the field as well
as disabled the _source feature on that document type.

I also feel as there will be several indexing/reindexing operations will
be going on simultaneously in the JVM (I am currently using Java API), it
will be highly memory inefficient to bring all the whole field value as
large string object into memory.

I am wondering if the '*streamability' *of the field in document in
Java API can be used in someway to make my use case memory efficient. If
yes, I would like to see a kind of pseudocode or example for steps of
stream operations to perform to achieve the this use case memory
efficiently?

--

Thanks Jorg for the detailed response. After your explanation about Lucene
3's weakness in dealing with streams, I wondered if there is some
limitation on the maximum field value length in bytes. On search I found
that there is an option for setting the max field length on IndexWriter
with:
IndexWriter.MaxFieldLengthhttp://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/api/all/org/apache/lucene/index/IndexWriter.MaxFieldLength.html It
is possible to set it to unlimited which restricts it by Integer.MAX_VALUE.
So is there a way to specify this in ES, or whether the default itself is
unlimited in ES?

Does the Lucene 3's non-streamability also imply that it may raise Out Of
Memory Exception while passing it large streams through ES field value?

On Friday, 9 November 2012 19:42:48 UTC+5:30, Jörg Prante wrote:

Hi Atharva,

you are right, streamability of data is a concern. ES nodes transport all
the data over the wire and provides an API for that.

But, consider when it comes to indexing a Lucene Field from a list of
terms. In Lucene 3, you have to add all the terms and fields to the
document, before the document got indexed. It is not possible to get an
advantage by streaming field data, it is just another style of providing
terms for the Lucene analysis.

In Lucene 4, field data is processed by a pluggable codecs. With a custom
codec that is able to consume large term streams efficiently, it could be
possible to benefit from the idea to stream large term streams into a
single Lucene field. But, for creating an inverted index, it is required to
reference to all the document data with all the field in memory at once for
generating term/field/doc statistics for relevance scoring. That is the
price you pay for inverted indexing.

Or your scenario is different. You do not want term indexing, but simply
storing key/value like data where the value is an opaque binary stream. I
think there are also enhancements available in Lucene 4 for key/value style.

Another related issue is JSON combined with binary stream data. ES encodes
binary streams into base64, and offers compression to remedy the overhead
situation.

We will wait and see what improvements ES will offer in 0.21 and later!

Jörg

On Friday, November 9, 2012 10:03:54 AM UTC+1, Atharva Patel wrote:

Can the Text, XContent, ByteReference of elasticsearch be used here to
achieve this?

On Friday, 9 November 2012 13:52:35 UTC+5:30, Atharva Patel wrote:

I have a use case where I need to append certain amount of text to an
existing field of the document which has been already indexed earlier.
After appending I will be reindexing the document. I am predicting the
amount of data which is going to be stored in that single field of my
document will be pretty large so I decided not to store the field as well
as disabled the _source feature on that document type.

I also feel as there will be several indexing/reindexing operations will
be going on simultaneously in the JVM (I am currently using Java API), it
will be highly memory inefficient to bring all the whole field value as
large string object into memory.

I am wondering if the '*streamability' *of the field in document in
Java API can be used in someway to make my use case memory efficient. If
yes, I would like to see a kind of pseudocode or example for steps of
stream operations to perform to achieve the this use case memory
efficiently?

--

You should increase the HTTP request upload limit which is set to 100 MB by
default in ES.

There is no field length limit in Lucene set by ES, so I think it is
Integer.MAX_VALUE in IndexWriter which is depreceated. Chirs Male stated
it, the new limit is only restricted by the numbers of tokens, which is
only an option. For example, it can be set in
http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/analysis/LimitTokenCountAnalyzer.html

Of course, you will enter OOMs if the heap size of the nodes can't handle
the amount of data. You need to experiment. Heap limits can be changed in
bin/elasticsearch.in.sh

Cheers,

Jörg

On Saturday, November 10, 2012 9:38:57 AM UTC+1, Atharva Patel wrote:

Thanks Jorg for the detailed response. After your explanation about Lucene
3's weakness in dealing with streams, I wondered if there is some
limitation on the maximum field value length in bytes. On search I found
that there is an option for setting the max field length on IndexWriter
with:
IndexWriter.MaxFieldLengthhttp://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/api/all/org/apache/lucene/index/IndexWriter.MaxFieldLength.html It
is possible to set it to unlimited which restricts it by Integer.MAX_VALUE.
So is there a way to specify this in ES, or whether the default itself is
unlimited in ES?

Does the Lucene 3's non-streamability also imply that it may raise Out Of
Memory Exception while passing it large streams through ES field value?

On Friday, 9 November 2012 19:42:48 UTC+5:30, Jörg Prante wrote:

Hi Atharva,

you are right, streamability of data is a concern. ES nodes transport all
the data over the wire and provides an API for that.

But, consider when it comes to indexing a Lucene Field from a list of
terms. In Lucene 3, you have to add all the terms and fields to the
document, before the document got indexed. It is not possible to get an
advantage by streaming field data, it is just another style of providing
terms for the Lucene analysis.

In Lucene 4, field data is processed by a pluggable codecs. With a custom
codec that is able to consume large term streams efficiently, it could be
possible to benefit from the idea to stream large term streams into a
single Lucene field. But, for creating an inverted index, it is required to
reference to all the document data with all the field in memory at once for
generating term/field/doc statistics for relevance scoring. That is the
price you pay for inverted indexing.

Or your scenario is different. You do not want term indexing, but simply
storing key/value like data where the value is an opaque binary stream. I
think there are also enhancements available in Lucene 4 for key/value style.

Another related issue is JSON combined with binary stream data. ES
encodes binary streams into base64, and offers compression to remedy the
overhead situation.

We will wait and see what improvements ES will offer in 0.21 and later!

Jörg

On Friday, November 9, 2012 10:03:54 AM UTC+1, Atharva Patel wrote:

Can the Text, XContent, ByteReference of elasticsearch be used here to
achieve this?

On Friday, 9 November 2012 13:52:35 UTC+5:30, Atharva Patel wrote:

I have a use case where I need to append certain amount of text to an
existing field of the document which has been already indexed earlier.
After appending I will be reindexing the document. I am predicting the
amount of data which is going to be stored in that single field of my
document will be pretty large so I decided not to store the field as well
as disabled the _source feature on that document type.

I also feel as there will be several indexing/reindexing operations
will be going on simultaneously in the JVM (I am currently using Java API),
it will be highly memory inefficient to bring all the whole field value as
large string object into memory.

I am wondering if the '*streamability' *of the field in document in
Java API can be used in someway to make my use case memory efficient. If
yes, I would like to see a kind of pseudocode or example for steps of
stream operations to perform to achieve the this use case memory
efficiently?

--