Does the server support streaming?


(Ryan Pedela) #1

Let's say I have a million documents I want to index. I am aware that you
can index 100 documents at a time or 1000 at a time using the bulk API.
However I could also write my HTTP client to stream all one million
documents as bytes with a single bulk API call. This would be advantageous
because the number of round-trips would be reduced which would reduce HTTP
overhead. In addition, I will have predictable memory usage on the client.
In a batch, there could be documents that are a few KB, a few MB, or a few
GB. This means it can be hard to predict the memory requirements of the
client. With byte streaming, the memory requirements are whatever I set the
buffer to.

My question: does the ES server support streaming a large API call? Or does
it store the entire API call in memory and then process it? I have the same
question for search results but in reverse.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6e494a74-3ffb-4afb-88af-61f2fb799eee%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

You are correct, ES nodes consumes data request by request, before they are
passed on through the cluster. Also the bulk indexing requests, such
requests are temporarily pushed to buffers, but they are split by lines and
executed as single actions.

So to reduce network roundtrips, the best thing is to use the bulk API.
What is left is a few percent to optimize, which is not much worth it. With
gzip, ES HTTP provides transparent compression. Main challenge is HTTP
overhead (headers can't be compressed), and base64, if you use binary data
with ES.

Please note that you must evaluate the bulk responses too, in order to
validate the notification about bulk success on doc level.

It is possible to extend the whole ES API also to Websocket, so beside
JSON, it could also be possible to transfer JSON text frames or
SMILE/binary frames on a single bi-directional channel. HTTP must use two
channels for this, so with Websocket, you can reduce connection resources
to the half. In this sense, the Netty channel / REST / Java API could be
extended for special realtime WS streaming mode applications, like for
pubsub applications. I experimented with that some time ago on ES 0.20
https://github.com/jprante/elasticsearch-transport-websocket (needs
updating)

From what I understand, the thrift transport plugin compiles the ES API,
operates in a streaming-like fashion, and is providing a solution that
reduces HTTP overhead:

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH7wM%2BpdVpH9%3Dysoq7a0CesOGxDnY4yAwQAeAcqLWDGvQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(eunever32) #3

Hi,

I have a similar question as the OP : what is the best way to get 1m or 30m
records indexed?
I mean I can send client.bulk batches of records but while the request is
being indexed the client is waiting: valuable seconds.

Also I have tried: python elasticsearch-py and there is a helpers.bulk and
helpers.streaming_bulk
And looking at the source code I can see that helpers.bulk calls ->
helpers.streaming_bulk so is it the same thing? ie
I should continue to call helpers.bulk? Or what is the difference?

Thanks,

On Wednesday, January 8, 2014 6:02:25 PM UTC, Jörg Prante wrote:

You are correct, ES nodes consumes data request by request, before they
are passed on through the cluster. Also the bulk indexing requests, such
requests are temporarily pushed to buffers, but they are split by lines and
executed as single actions.

So to reduce network roundtrips, the best thing is to use the bulk API.
What is left is a few percent to optimize, which is not much worth it. With
gzip, ES HTTP provides transparent compression. Main challenge is HTTP
overhead (headers can't be compressed), and base64, if you use binary data
with ES.

Please note that you must evaluate the bulk responses too, in order to
validate the notification about bulk success on doc level.

It is possible to extend the whole ES API also to Websocket, so beside
JSON, it could also be possible to transfer JSON text frames or
SMILE/binary frames on a single bi-directional channel. HTTP must use two
channels for this, so with Websocket, you can reduce connection resources
to the half. In this sense, the Netty channel / REST / Java API could be
extended for special realtime WS streaming mode applications, like for
pubsub applications. I experimented with that some time ago on ES 0.20
https://github.com/jprante/elasticsearch-transport-websocket (needs
updating)

From what I understand, the thrift transport plugin compiles the ES API,
operates in a streaming-like fashion, and is providing a solution that
reduces HTTP overhead:
https://github.com/elasticsearch/elasticsearch-transport-thrift

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b2702386-ca31-4551-9a92-15775a9011d2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #4

Hi,

the streaming_bulk function in elasticsearch-py is a helper that will
actually split the stream of documents into chunk and send them to
elasticsearch - it does not stream all documents to es as a single
request. It is impossible (due to the nature of bulk requests) for
elasticsearch to consume arbitrary number of documents in a single
request so this helper was created to give you the abstraction.

The difference between bulk and streaming_bulk is in the way it's
executed and returned - bulk will just return statistics/errors while
streaming_bulk is a generator that will keep yielding results per
document, thus completely hiding the fact that the stream is being
sent to ES in chunks.

Hope this helps,
Honza

On Sun, Mar 2, 2014 at 6:46 PM, eunever32@gmail.com wrote:

Hi,

I have a similar question as the OP : what is the best way to get 1m or 30m
records indexed?
I mean I can send client.bulk batches of records but while the request is
being indexed the client is waiting: valuable seconds.

Also I have tried: python elasticsearch-py and there is a helpers.bulk and
helpers.streaming_bulk
And looking at the source code I can see that helpers.bulk calls ->
helpers.streaming_bulk so is it the same thing? ie
I should continue to call helpers.bulk? Or what is the difference?

Thanks,

On Wednesday, January 8, 2014 6:02:25 PM UTC, Jörg Prante wrote:

You are correct, ES nodes consumes data request by request, before they
are passed on through the cluster. Also the bulk indexing requests, such
requests are temporarily pushed to buffers, but they are split by lines and
executed as single actions.

So to reduce network roundtrips, the best thing is to use the bulk API.
What is left is a few percent to optimize, which is not much worth it. With
gzip, ES HTTP provides transparent compression. Main challenge is HTTP
overhead (headers can't be compressed), and base64, if you use binary data
with ES.

Please note that you must evaluate the bulk responses too, in order to
validate the notification about bulk success on doc level.

It is possible to extend the whole ES API also to Websocket, so beside
JSON, it could also be possible to transfer JSON text frames or SMILE/binary
frames on a single bi-directional channel. HTTP must use two channels for
this, so with Websocket, you can reduce connection resources to the half. In
this sense, the Netty channel / REST / Java API could be extended for
special realtime WS streaming mode applications, like for pubsub
applications. I experimented with that some time ago on ES 0.20
https://github.com/jprante/elasticsearch-transport-websocket (needs
updating)

From what I understand, the thrift transport plugin compiles the ES API,
operates in a streaming-like fashion, and is providing a solution that
reduces HTTP overhead:
https://github.com/elasticsearch/elasticsearch-transport-thrift

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b2702386-ca31-4551-9a92-15775a9011d2%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDipwE%2BEXrNEnn00%2BdM0ci_mSZektwKEf3Bpv6v3ydEd1pA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(eunever32) #5

Hey thanks,

So is there a convenient way to asynchronously call the bulk (helpers bulk
or helpers streaming_bulk) in a way that means the client isn't waiting for
the request to complete?

On Sunday, March 2, 2014 5:51:15 PM UTC, Honza Král wrote:

Hi,

the streaming_bulk function in elasticsearch-py is a helper that will
actually split the stream of documents into chunk and send them to
elasticsearch - it does not stream all documents to es as a single
request. It is impossible (due to the nature of bulk requests) for
elasticsearch to consume arbitrary number of documents in a single
request so this helper was created to give you the abstraction.

The difference between bulk and streaming_bulk is in the way it's
executed and returned - bulk will just return statistics/errors while
streaming_bulk is a generator that will keep yielding results per
document, thus completely hiding the fact that the stream is being
sent to ES in chunks.

Hope this helps,
Honza

On Sun, Mar 2, 2014 at 6:46 PM, <eune...@gmail.com <javascript:>> wrote:

Hi,

I have a similar question as the OP : what is the best way to get 1m or
30m
records indexed?
I mean I can send client.bulk batches of records but while the request
is
being indexed the client is waiting: valuable seconds.

Also I have tried: python elasticsearch-py and there is a helpers.bulk
and
helpers.streaming_bulk
And looking at the source code I can see that helpers.bulk calls ->
helpers.streaming_bulk so is it the same thing? ie
I should continue to call helpers.bulk? Or what is the difference?

Thanks,

On Wednesday, January 8, 2014 6:02:25 PM UTC, Jörg Prante wrote:

You are correct, ES nodes consumes data request by request, before they
are passed on through the cluster. Also the bulk indexing requests,
such

requests are temporarily pushed to buffers, but they are split by lines
and

executed as single actions.

So to reduce network roundtrips, the best thing is to use the bulk API.
What is left is a few percent to optimize, which is not much worth it.
With

gzip, ES HTTP provides transparent compression. Main challenge is HTTP
overhead (headers can't be compressed), and base64, if you use binary
data

with ES.

Please note that you must evaluate the bulk responses too, in order to
validate the notification about bulk success on doc level.

It is possible to extend the whole ES API also to Websocket, so beside
JSON, it could also be possible to transfer JSON text frames or
SMILE/binary

frames on a single bi-directional channel. HTTP must use two channels
for

this, so with Websocket, you can reduce connection resources to the
half. In

this sense, the Netty channel / REST / Java API could be extended for
special realtime WS streaming mode applications, like for pubsub
applications. I experimented with that some time ago on ES 0.20
https://github.com/jprante/elasticsearch-transport-websocket (needs
updating)

From what I understand, the thrift transport plugin compiles the ES
API,

operates in a streaming-like fashion, and is providing a solution that
reduces HTTP overhead:
https://github.com/elasticsearch/elasticsearch-transport-thrift

Jörg

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/b2702386-ca31-4551-9a92-15775a9011d2%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f0bc0dc9-7341-4eb4-9e78-ff5200a03635%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #6

Well, you can just use any async http library to do it, but I wouldn't
recommend it since putting it all back together might be difficult (to
see which documents failed to index etc). You can always just have a
couple of threads each running a streaming bulk, reading from a Queue
and writing the results to another Queue, should be fairly easy to do
in your code.

On Sun, Mar 2, 2014 at 7:17 PM, eunever32@gmail.com wrote:

Hey thanks,

So is there a convenient way to asynchronously call the bulk (helpers bulk
or helpers streaming_bulk) in a way that means the client isn't waiting for
the request to complete?

On Sunday, March 2, 2014 5:51:15 PM UTC, Honza Král wrote:

Hi,

the streaming_bulk function in elasticsearch-py is a helper that will
actually split the stream of documents into chunk and send them to
elasticsearch - it does not stream all documents to es as a single
request. It is impossible (due to the nature of bulk requests) for
elasticsearch to consume arbitrary number of documents in a single
request so this helper was created to give you the abstraction.

The difference between bulk and streaming_bulk is in the way it's
executed and returned - bulk will just return statistics/errors while
streaming_bulk is a generator that will keep yielding results per
document, thus completely hiding the fact that the stream is being
sent to ES in chunks.

Hope this helps,
Honza

On Sun, Mar 2, 2014 at 6:46 PM, eune...@gmail.com wrote:

Hi,

I have a similar question as the OP : what is the best way to get 1m or
30m
records indexed?
I mean I can send client.bulk batches of records but while the request
is
being indexed the client is waiting: valuable seconds.

Also I have tried: python elasticsearch-py and there is a helpers.bulk
and
helpers.streaming_bulk
And looking at the source code I can see that helpers.bulk calls ->
helpers.streaming_bulk so is it the same thing? ie
I should continue to call helpers.bulk? Or what is the difference?

Thanks,

On Wednesday, January 8, 2014 6:02:25 PM UTC, Jörg Prante wrote:

You are correct, ES nodes consumes data request by request, before they
are passed on through the cluster. Also the bulk indexing requests,
such
requests are temporarily pushed to buffers, but they are split by lines
and
executed as single actions.

So to reduce network roundtrips, the best thing is to use the bulk API.
What is left is a few percent to optimize, which is not much worth it.
With
gzip, ES HTTP provides transparent compression. Main challenge is HTTP
overhead (headers can't be compressed), and base64, if you use binary
data
with ES.

Please note that you must evaluate the bulk responses too, in order to
validate the notification about bulk success on doc level.

It is possible to extend the whole ES API also to Websocket, so beside
JSON, it could also be possible to transfer JSON text frames or
SMILE/binary
frames on a single bi-directional channel. HTTP must use two channels
for
this, so with Websocket, you can reduce connection resources to the
half. In
this sense, the Netty channel / REST / Java API could be extended for
special realtime WS streaming mode applications, like for pubsub
applications. I experimented with that some time ago on ES 0.20
https://github.com/jprante/elasticsearch-transport-websocket (needs
updating)

From what I understand, the thrift transport plugin compiles the ES
API,
operates in a streaming-like fashion, and is providing a solution that
reduces HTTP overhead:
https://github.com/elasticsearch/elasticsearch-transport-thrift

Jörg

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/b2702386-ca31-4551-9a92-15775a9011d2%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f0bc0dc9-7341-4eb4-9e78-ff5200a03635%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDir2qbnprp6hbnv4paB8S9GdGEwDbjnhPnqFxX%3DxSn9SOg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(eunever32) #7

so excuse me Honza.. am I correct thinking there is no point from a
performance perspective calling helpers.bulk because it will just be
"sliced" into the chunk size by streaming bulk anyway.

It would make more sense to call helpers.streaming_bulk directly to reduce
the client side activity?

And actually from a raw performance perspective just call client.bulk ?

Thanks

On Sunday, March 2, 2014 6:21:14 PM UTC, Honza Král wrote:

Well, you can just use any async http library to do it, but I wouldn't
recommend it since putting it all back together might be difficult (to
see which documents failed to index etc). You can always just have a
couple of threads each running a streaming bulk, reading from a Queue
and writing the results to another Queue, should be fairly easy to do
in your code.

On Sun, Mar 2, 2014 at 7:17 PM, <eune...@gmail.com <javascript:>> wrote:

Hey thanks,

So is there a convenient way to asynchronously call the bulk (helpers
bulk
or helpers streaming_bulk) in a way that means the client isn't waiting
for
the request to complete?

On Sunday, March 2, 2014 5:51:15 PM UTC, Honza Král wrote:

Hi,

the streaming_bulk function in elasticsearch-py is a helper that will
actually split the stream of documents into chunk and send them to
elasticsearch - it does not stream all documents to es as a single
request. It is impossible (due to the nature of bulk requests) for
elasticsearch to consume arbitrary number of documents in a single
request so this helper was created to give you the abstraction.

The difference between bulk and streaming_bulk is in the way it's
executed and returned - bulk will just return statistics/errors while
streaming_bulk is a generator that will keep yielding results per
document, thus completely hiding the fact that the stream is being
sent to ES in chunks.

Hope this helps,
Honza

On Sun, Mar 2, 2014 at 6:46 PM, eune...@gmail.com wrote:

Hi,

I have a similar question as the OP : what is the best way to get 1m
or

30m
records indexed?
I mean I can send client.bulk batches of records but while the
request

is
being indexed the client is waiting: valuable seconds.

Also I have tried: python elasticsearch-py and there is a
helpers.bulk

and
helpers.streaming_bulk
And looking at the source code I can see that helpers.bulk calls ->
helpers.streaming_bulk so is it the same thing? ie
I should continue to call helpers.bulk? Or what is the difference?

Thanks,

On Wednesday, January 8, 2014 6:02:25 PM UTC, Jörg Prante wrote:

You are correct, ES nodes consumes data request by request, before
they

are passed on through the cluster. Also the bulk indexing requests,
such
requests are temporarily pushed to buffers, but they are split by
lines

and
executed as single actions.

So to reduce network roundtrips, the best thing is to use the bulk
API.

What is left is a few percent to optimize, which is not much worth
it.

With
gzip, ES HTTP provides transparent compression. Main challenge is
HTTP

overhead (headers can't be compressed), and base64, if you use
binary

data
with ES.

Please note that you must evaluate the bulk responses too, in order
to

validate the notification about bulk success on doc level.

It is possible to extend the whole ES API also to Websocket, so
beside

JSON, it could also be possible to transfer JSON text frames or
SMILE/binary
frames on a single bi-directional channel. HTTP must use two
channels

for
this, so with Websocket, you can reduce connection resources to the
half. In
this sense, the Netty channel / REST / Java API could be extended
for

special realtime WS streaming mode applications, like for pubsub
applications. I experimented with that some time ago on ES 0.20
https://github.com/jprante/elasticsearch-transport-websocket (needs
updating)

From what I understand, the thrift transport plugin compiles the ES
API,
operates in a streaming-like fashion, and is providing a solution
that

reduces HTTP overhead:
https://github.com/elasticsearch/elasticsearch-transport-thrift

Jörg

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/b2702386-ca31-4551-9a92-15775a9011d2%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/f0bc0dc9-7341-4eb4-9e78-ff5200a03635%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8ae39c83-3805-40d8-8299-298578353e59%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #8

The difference between stream_bulk and regular bulk is just in the
API, under the hood they both perform the same operation. The only
difference is that bulk only returns once all documents have been sent
whereas streaming_bulk is a generator that keeps yielding individual
results.

On Sun, Mar 2, 2014 at 8:22 PM, eunever32@gmail.com wrote:

so excuse me Honza.. am I correct thinking there is no point from a
performance perspective calling helpers.bulk because it will just be
"sliced" into the chunk size by streaming bulk anyway.

It would make more sense to call helpers.streaming_bulk directly to reduce
the client side activity?

And actually from a raw performance perspective just call client.bulk ?

Thanks

On Sunday, March 2, 2014 6:21:14 PM UTC, Honza Král wrote:

Well, you can just use any async http library to do it, but I wouldn't
recommend it since putting it all back together might be difficult (to
see which documents failed to index etc). You can always just have a
couple of threads each running a streaming bulk, reading from a Queue
and writing the results to another Queue, should be fairly easy to do
in your code.

On Sun, Mar 2, 2014 at 7:17 PM, eune...@gmail.com wrote:

Hey thanks,

So is there a convenient way to asynchronously call the bulk (helpers
bulk
or helpers streaming_bulk) in a way that means the client isn't waiting
for
the request to complete?

On Sunday, March 2, 2014 5:51:15 PM UTC, Honza Král wrote:

Hi,

the streaming_bulk function in elasticsearch-py is a helper that will
actually split the stream of documents into chunk and send them to
elasticsearch - it does not stream all documents to es as a single
request. It is impossible (due to the nature of bulk requests) for
elasticsearch to consume arbitrary number of documents in a single
request so this helper was created to give you the abstraction.

The difference between bulk and streaming_bulk is in the way it's
executed and returned - bulk will just return statistics/errors while
streaming_bulk is a generator that will keep yielding results per
document, thus completely hiding the fact that the stream is being
sent to ES in chunks.

Hope this helps,
Honza

On Sun, Mar 2, 2014 at 6:46 PM, eune...@gmail.com wrote:

Hi,

I have a similar question as the OP : what is the best way to get 1m
or
30m
records indexed?
I mean I can send client.bulk batches of records but while the
request
is
being indexed the client is waiting: valuable seconds.

Also I have tried: python elasticsearch-py and there is a
helpers.bulk
and
helpers.streaming_bulk
And looking at the source code I can see that helpers.bulk calls ->
helpers.streaming_bulk so is it the same thing? ie
I should continue to call helpers.bulk? Or what is the difference?

Thanks,

On Wednesday, January 8, 2014 6:02:25 PM UTC, Jörg Prante wrote:

You are correct, ES nodes consumes data request by request, before
they
are passed on through the cluster. Also the bulk indexing requests,
such
requests are temporarily pushed to buffers, but they are split by
lines
and
executed as single actions.

So to reduce network roundtrips, the best thing is to use the bulk
API.
What is left is a few percent to optimize, which is not much worth
it.
With
gzip, ES HTTP provides transparent compression. Main challenge is
HTTP
overhead (headers can't be compressed), and base64, if you use
binary
data
with ES.

Please note that you must evaluate the bulk responses too, in order
to
validate the notification about bulk success on doc level.

It is possible to extend the whole ES API also to Websocket, so
beside
JSON, it could also be possible to transfer JSON text frames or
SMILE/binary
frames on a single bi-directional channel. HTTP must use two
channels
for
this, so with Websocket, you can reduce connection resources to the
half. In
this sense, the Netty channel / REST / Java API could be extended
for
special realtime WS streaming mode applications, like for pubsub
applications. I experimented with that some time ago on ES 0.20
https://github.com/jprante/elasticsearch-transport-websocket (needs
updating)

From what I understand, the thrift transport plugin compiles the ES
API,
operates in a streaming-like fashion, and is providing a solution
that
reduces HTTP overhead:
https://github.com/elasticsearch/elasticsearch-transport-thrift

Jörg

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/b2702386-ca31-4551-9a92-15775a9011d2%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/f0bc0dc9-7341-4eb4-9e78-ff5200a03635%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8ae39c83-3805-40d8-8299-298578353e59%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDiqxXZ%2Be4uT_Fmv8WZE1HjUd94LM9Y6wVnQD_j1Gp6MbEg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Randall McRee) #9

HTTP overhead is miniscule compared to the server-side (elasticsearch)
resources which are required to index the documents. Even with bulk and no
streaming etc. the bottleneck is in building the index and in particular,
disk I/O (primarily) as well as cpu and memory.

So, regardless, your client will always end up waiting for the server to
finish these necessary tasks since they require so much more time and space
than simply sending the documents across the network.

On Wed, Jan 8, 2014 at 8:58 AM, Ryan Pedela rpedela@datalanche.com wrote:

Let's say I have a million documents I want to index. I am aware that you
can index 100 documents at a time or 1000 at a time using the bulk API.
However I could also write my HTTP client to stream all one million
documents as bytes with a single bulk API call. This would be advantageous
because the number of round-trips would be reduced which would reduce HTTP
overhead. In addition, I will have predictable memory usage on the client.
In a batch, there could be documents that are a few KB, a few MB, or a few
GB. This means it can be hard to predict the memory requirements of the
client. With byte streaming, the memory requirements are whatever I set the
buffer to.

My question: does the ES server support streaming a large API call? Or
does it store the entire API call in memory and then process it? I have the
same question for search results but in reverse.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6e494a74-3ffb-4afb-88af-61f2fb799eee%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFjHw360_3FZm0_-GoRCbsb-9ZtuBuyf8%3Dqr0g6PsMUTWt41%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #10

Hi,

The key is to find the ideal bulk size and the ideal bulk request
concurrency level, and then make sure the client always feeds ES enough
data to achieve (close to) ideal utilization and minimize idling on either
side.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Monday, March 3, 2014 6:28:45 PM UTC-5, RKM wrote:

HTTP overhead is miniscule compared to the server-side (elasticsearch)
resources which are required to index the documents. Even with bulk and no
streaming etc. the bottleneck is in building the index and in particular,
disk I/O (primarily) as well as cpu and memory.

So, regardless, your client will always end up waiting for the server to
finish these necessary tasks since they require so much more time and space
than simply sending the documents across the network.

On Wed, Jan 8, 2014 at 8:58 AM, Ryan Pedela <rpe...@datalanche.com<javascript:>

wrote:

Let's say I have a million documents I want to index. I am aware that you
can index 100 documents at a time or 1000 at a time using the bulk API.
However I could also write my HTTP client to stream all one million
documents as bytes with a single bulk API call. This would be advantageous
because the number of round-trips would be reduced which would reduce HTTP
overhead. In addition, I will have predictable memory usage on the client.
In a batch, there could be documents that are a few KB, a few MB, or a few
GB. This means it can be hard to predict the memory requirements of the
client. With byte streaming, the memory requirements are whatever I set the
buffer to.

My question: does the ES server support streaming a large API call? Or
does it store the entire API call in memory and then process it? I have the
same question for search results but in reverse.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6e494a74-3ffb-4afb-88af-61f2fb799eee%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d31c868e-d12b-4340-a047-9c3d2eb673a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #11