Pushing bulk data to ES in a stream


(abhijit.singh) #1

Hello everyone,

I wanted to know if it is possible to index the docs through a stream which
pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push our
docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/47377f8b-5825-47b6-9eba-ea2521fbab92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

Can you describe what the "significant delay" is, according to your
requirements? How do you stream data out of Postgresql, with interpreting
WAL files?

Note, you can lower bulk indexing request length / request size, so it
works closer to realtime indexing, like streaming. Have you tried this?

Jörg

On Fri, Aug 8, 2014 at 9:45 PM, abhijit.singh@housing.com wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/47377f8b-5825-47b6-9eba-ea2521fbab92%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/47377f8b-5825-47b6-9eba-ea2521fbab92%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF%2B_0J%2B5TRv2Po6%2B2X4Z4U4f3yQSe6qLa7pe%3DoyYDc7Tw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(abhijit.singh) #3

Thanks for the reply. Cursors are the what we used for getting large data
from Postgres. sequel_pg the ruby gem has also added streaming support for
postgres having version greater than 9.2. The ruby code doesn't complain
when we deal with limited data.

But getting batches of 1000 rows from postgres and posting 1000 docs to ES
don't take equal time. Obviously the response time for bulk posting of docs
is longer. We haven't tried lowering our request length/sizes. I guess that
would increase the time.

On Saturday, August 9, 2014 1:15:17 AM UTC+5:30, abhiji...@housing.com
wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2c622b2c-5b0f-4099-9111-cb9f109a573d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

How much time does it take for 1000 rows from Postgresql?

Maybe your configuration is not suitable for the correct bulk indexing?
Elasticsearch can easily handle 1000 rows in milliseconds, and it can index
bulk request concurrently. It's all a matter of the bulk index programming.
You should use the Java API for that.

Jörg

On Sat, Aug 9, 2014 at 5:35 PM, abhijit.singh@housing.com wrote:

Thanks for the reply. Cursors are the what we used for getting large data
from Postgres. sequel_pg the ruby gem has also added streaming support for
postgres having version greater than 9.2. The ruby code doesn't complain
when we deal with limited data.

But getting batches of 1000 rows from postgres and posting 1000 docs to ES
don't take equal time. Obviously the response time for bulk posting of docs
is longer. We haven't tried lowering our request length/sizes. I guess that
would increase the time.

On Saturday, August 9, 2014 1:15:17 AM UTC+5:30, abhiji...@housing.com
wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2c622b2c-5b0f-4099-9111-cb9f109a573d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2c622b2c-5b0f-4099-9111-cb9f109a573d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFYG-qwzxj-Q7H4-XLpwz_PZPxhV06KDfdcCXq9vir-8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(AbhijitPratap Singh) #5

There is this initial time of obtaining the first batch of 1000 rows from
the result set having lets say 50000 rows after which it just uses cursors
and fetches 1000 rows in batches in no time.

First batch fetch 200ms
Second onwards 30ms

While we dont have our ES cluster on the same node as Postgres so I will
agree that issues of network latency are present. But barring the first
fetch bulk response time from ES is always much more than the time of
fetching 1000 rows from Postgres.

Concurrent push would have been the ideal solution which I agree to. ES
surely can handle a lot of docs concurrently but it increases memory
footprint in our ruby code.

I looked at the websocket transport protocol provided by ES. Can it be of
some help here in this situation???

On Saturday, August 9, 2014 1:15:17 AM UTC+5:30, abhiji...@housing.com
wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d0dc7c5d-515f-4efd-8c13-1ec0c5b3da3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #6

What kind of data, what keys and values, are you fetching?

Note, if you have new fields, ES needs time to create the dynamic mapping.
Also, if you index into a new index, ES needs time to create the index.
Another point is the rendezvous, if you start a client, it need time to
connect with all the nodes of the cluster over the network and detect a
master (if it's a node client). 200ms would be very fast for this.

All of this can be prepared beforehand, and then you could fetch data from
postgresql, or from any other sources.

Websockets have nothing to do with the challenge. Websockets are for
bidirectional interaction style, there is no impact on the internals of
bulk indexing.

With the right client, you can already bulk indexing concurrently, with
maximum speed. I admit, except Java JVM based implementations using the
transport layer protocol, I do not know of concurrent bulk indexing
implementations yet.

Jörg

On Mon, Aug 11, 2014 at 2:50 AM, AbhijitPratap Singh <
singhabhijitpratap@gmail.com> wrote:

There is this initial time of obtaining the first batch of 1000 rows from
the result set having lets say 50000 rows after which it just uses cursors
and fetches 1000 rows in batches in no time.

First batch fetch 200ms
Second onwards 30ms

While we dont have our ES cluster on the same node as Postgres so I will
agree that issues of network latency are present. But barring the first
fetch bulk response time from ES is always much more than the time of
fetching 1000 rows from Postgres.

Concurrent push would have been the ideal solution which I agree to. ES
surely can handle a lot of docs concurrently but it increases memory
footprint in our ruby code.

I looked at the websocket transport protocol provided by ES. Can it be of
some help here in this situation???

On Saturday, August 9, 2014 1:15:17 AM UTC+5:30, abhiji...@housing.com
wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d0dc7c5d-515f-4efd-8c13-1ec0c5b3da3b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d0dc7c5d-515f-4efd-8c13-1ec0c5b3da3b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE83CHauxRow%2B6doHM1F%3D8x6wRBeCf0Umn2ZPwtWQTr0Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(abhijit.singh) #7

Thanks a lot. I understand the few bottlenecks necessitated while a client
is initialized. We have a tight schema for our docs wherein all our docs
follow a strict schema already enforced before indexing starts to take
place.
You can image a sample of our doc to have 32 odd keys with 20 of them
having string analyzed and not_analyzed property and the remaining either
boolean or integer and few of them also having "completion" enabled.

Although websockets are not sockets, a socket based stream communication
might surely have helped I think in that we would have pushed docs to the
stream continuously and concurrently.
Please share some ES clients written in any language which do async
indexing. I am not aware of such a client in Ruby.

Can you give some examples of client which index concurrently. Atleast I am
not aware if such mechanism exists on Ruby client.

On Saturday, August 9, 2014 1:15:17 AM UTC+5:30, abhiji...@housing.com
wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a1937e25-f75f-4b93-bad2-85efa5342009%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #8

Websockets are raw TCP/IP socket. If you want a comparison with bulk
indexing over HTTP (port 9200), you will find that if you set HTTP
keep-alive you will get similar performance patterns, and with the
transport protocol (port 9300), you can set TCP socket keepalive and port
reuse flags, which ensures a stream connection. The difference in
comparison to websocket is that ES uses compression and a different back
channel for sending responses but that is not so much relevant.

You can check out the org.elasticsearch.action.bulk.BulkProcessor class for
concurrent bulk indexing.

An example Java source code for usage is

https://github.com/jprante/elasticsearch-support/blob/master/src/main/java/org/xbib/elasticsearch/support/client/bulk/BulkTransportClient.java

Jörg

On Mon, Aug 11, 2014 at 2:09 PM, abhijit.singh@housing.com wrote:

Thanks a lot. I understand the few bottlenecks necessitated while a client
is initialized. We have a tight schema for our docs wherein all our docs
follow a strict schema already enforced before indexing starts to take
place.
You can image a sample of our doc to have 32 odd keys with 20 of them
having string analyzed and not_analyzed property and the remaining either
boolean or integer and few of them also having "completion" enabled.

Although websockets are not sockets, a socket based stream communication
might surely have helped I think in that we would have pushed docs to the
stream continuously and concurrently.
Please share some ES clients written in any language which do async
indexing. I am not aware of such a client in Ruby.

Can you give some examples of client which index concurrently. Atleast I
am not aware if such mechanism exists on Ruby client.

On Saturday, August 9, 2014 1:15:17 AM UTC+5:30, abhiji...@housing.com
wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a1937e25-f75f-4b93-bad2-85efa5342009%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a1937e25-f75f-4b93-bad2-85efa5342009%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFacuWcp2HmnnJwmYfUtYscvsiMOy7%3D9bEo694kahHD%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(abhijit.singh) #9

Thanks a lot for the response.

On Saturday, August 9, 2014 1:15:17 AM UTC+5:30, abhiji...@housing.com
wrote:

Hello everyone,

I wanted to know if it is possible to index the docs through a stream
which pushes data to the Elasticsearch cluster.

Our current problem is to index the huge set of data from Postgres to
Elasticsearch while processing the data in between. We have been able to
stream data out of Postgres enabling us to use constant memory in our Ruby
code but there is a significant delay while posting these docs in batches
to ES through the Bulk API.

I think it would be ideal if there would exist such a mechanism to push
our docs continuously in the ES cluster thereby reducing the bottleneck
currently created by the bulk call.

Also I would ideally have wanted to post all the batches of docs in a
different thread but that will create memory issues so I though streaming
to be a good alternative.

I apologise for the fact that this is a kind of subjective question but
please do ask for the code if you want to know something.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d14e7941-d736-4f7f-bfd3-dad3be245f6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #10