[Java] Stream large file while indexing

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

I don't believe this is possible. Would need some soft of JSON
streaming support.

How big are you talking about? If you're not storing as its own stored
field or part of the _source JSON, probably should be OK search side.

Thanks,
Paul

On Feb 14, 8:15 am, Vineeth Mohan vineethmo...@algotree.com wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

Its not on search side.
Mine is a multi threaded application and it is trying to index tonns of a
file at a time to ES.
So am trying to access main memory as les as possible.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 2:25 AM, ppearcy ppearcy@gmail.com wrote:

I don't believe this is possible. Would need some soft of JSON
streaming support.

How big are you talking about? If you're not storing as its own stored
field or part of the _source JSON, probably should be OK search side.

Thanks,
Paul

On Feb 14, 8:15 am, Vineeth Mohan vineethmo...@algotree.com wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the
content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

There is no support for streaming.

On Wednesday, February 15, 2012 at 10:33 AM, Vineeth Mohan wrote:

Its not on search side.
Mine is a multi threaded application and it is trying to index tonns of a file at a time to ES.
So am trying to access main memory as les as possible.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 2:25 AM, ppearcy <ppearcy@gmail.com (mailto:ppearcy@gmail.com)> wrote:

I don't believe this is possible. Would need some soft of JSON
streaming support.

How big are you talking about? If you're not storing as its own stored
field or part of the _source JSON, probably should be OK search side.

Thanks,
Paul

On Feb 14, 8:15 am, Vineeth Mohan <vineethmo...@algotree.com (mailto:vineethmo...@algotree.com)> wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

Hello Shay ,

Can you give clues or hints on how to implement this over the existing API
??
I was hoping to accomplish this task by overriding couple of methods.

Any pointers in this direction would be appreciated.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 4:09 PM, Shay Banon kimchy@gmail.com wrote:

There is no support for streaming.

On Wednesday, February 15, 2012 at 10:33 AM, Vineeth Mohan wrote:

Its not on search side.
Mine is a multi threaded application and it is trying to index tonns of a
file at a time to ES.
So am trying to access main memory as les as possible.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 2:25 AM, ppearcy ppearcy@gmail.com wrote:

I don't believe this is possible. Would need some soft of JSON
streaming support.

How big are you talking about? If you're not storing as its own stored
field or part of the _source JSON, probably should be OK search side.

Thanks,
Paul

On Feb 14, 8:15 am, Vineeth Mohan vineethmo...@algotree.com wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the
content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

Not really, there is no support for streaming single doc large data.

On Friday, February 17, 2012 at 2:03 PM, Vineeth Mohan wrote:

Hello Shay ,

Can you give clues or hints on how to implement this over the existing API ??
I was hoping to accomplish this task by overriding couple of methods.

Any pointers in this direction would be appreciated.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 4:09 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:

There is no support for streaming.

On Wednesday, February 15, 2012 at 10:33 AM, Vineeth Mohan wrote:

Its not on search side.
Mine is a multi threaded application and it is trying to index tonns of a file at a time to ES.
So am trying to access main memory as les as possible.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 2:25 AM, ppearcy <ppearcy@gmail.com (mailto:ppearcy@gmail.com)> wrote:

I don't believe this is possible. Would need some soft of JSON
streaming support.

How big are you talking about? If you're not storing as its own stored
field or part of the _source JSON, probably should be OK search side.

Thanks,
Paul

On Feb 14, 8:15 am, Vineeth Mohan <vineethmo...@algotree.com (mailto:vineethmo...@algotree.com)> wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

Hi Shay,

Has anything changed since? Maybe some third-party plugins exist, that
allow to do large files streaming to Elasticsearch?

On Friday, February 17, 2012 8:48:46 PM UTC+3, kimchy wrote:

Not really, there is no support for streaming single doc large data.

On Friday, February 17, 2012 at 2:03 PM, Vineeth Mohan wrote:

Hello Shay ,

Can you give clues or hints on how to implement this over the existing API
??
I was hoping to accomplish this task by overriding couple of methods.

Any pointers in this direction would be appreciated.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 4:09 PM, Shay Banon <kim...@gmail.com<javascript:>

wrote:

There is no support for streaming.

On Wednesday, February 15, 2012 at 10:33 AM, Vineeth Mohan wrote:

Its not on search side.
Mine is a multi threaded application and it is trying to index tonns of a
file at a time to ES.
So am trying to access main memory as les as possible.

Thanks
Vineeth

On Wed, Feb 15, 2012 at 2:25 AM, ppearcy <ppe...@gmail.com <javascript:>>wrote:

I don't believe this is possible. Would need some soft of JSON
streaming support.

How big are you talking about? If you're not storing as its own stored
field or part of the _source JSON, probably should be OK search side.

Thanks,
Paul

On Feb 14, 8:15 am, Vineeth Mohan vineethmo...@algotree.com wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the
content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There are several different aspects when it comes to streaming files to
Elasticsearch.

  1. The first is, input format is JSON, and the indexer should create
    large JSON docs, but for tiny display later (highlighting).

  2. The second is, you want binary files store inside Lucene unindexed
    (for whatever reason). The main challenge is how to handle this with
    JSON (it requires something like base64 encoding in combination with
    compression)

  3. And the third is, you want Elasticsearch do smart Lucene index codec
    processing to create documents from streams "on the fly"

-> 1. I think there is no advantage in streaming for this case, since
Lucene needs somewhere the whole document in memory for the inverted
index statistics computation. If your input documents are large, just
add enough heap memory to get them processed. I'm not sure but most
search engines out there including Google have a hard limit how much of
a document is analyzed for highlighting or term indexing (the first 10k
characters maybe?). The reason is better performance. So maybe there is
little sense to enforce these features on Elasticsearch with large docs.

-> 2. you should think about it twice and maybe it is better to store
these files outside Elasticsearch. Retrieving large stored docs may hit
your performance, also relocating shards will be slow.

-> 3. you can implement a custom codec for Lucene 4 that may handle
your content streams gracefully. Such a codec is unfortunately
domain-specific, since it depends on the nature of the stream elements.
For example, such a codec could do complex event stream processing (CEP)
like in Esper http://esper.codehaus.org/

My 2c.

Jörg

Am 06.05.13 21:34, schrieb Yermakovich Siarhei:

Has anything changed since? Maybe some third-party plugins exist, that
allow to do large files streaming to Elasticsearch?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vineeth,

I am suffering from the same issue here(I am using .net REST API, NEST)
Have you found solution of indexing large files? Thank you

Best Regards
Hao

On Tuesday, February 14, 2012 at 3:15:41 PM UTC, Vineeth Mohan wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1316bb3c-4dae-4c58-9815-c36636c31c2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I believe elasticsearch loads the whole indexed document into ram before
indexing. It certainly loads the whole document in ram for things like
source filtering. Lucene doesn't require this, but elasticsearch does it
because for the typical use case its fine.
On Mar 27, 2015 2:59 PM, "Hao" hao.qian.career@gmail.com wrote:

Hi Vineeth,

I am suffering from the same issue here(I am using .net REST API, NEST)
Have you found solution of indexing large files? Thank you

Best Regards
Hao

On Tuesday, February 14, 2012 at 3:15:41 PM UTC, Vineeth Mohan wrote:

Hi ,

While indexing i have a field whose value is quite large.
This value is stored in a file as text.

I prefer not to load the entire file into main memory to read the content.
Instead i would prefer this file is streamed directly to ES without
affecting much of RAM.

Is this possible with java API of ES ?

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1316bb3c-4dae-4c58-9815-c36636c31c2e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1316bb3c-4dae-4c58-9815-c36636c31c2e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0jQTNA43tssxrjsEjfJwf6%2Bh7Z-RGZpQHewRArB8bGHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.