Indexing multiple things at once. Possible?


(elasticsearcher) #1

I've searched around on the docs, and I haven't found a solution, so I thought I'd ask here.

In my program, I generate many short documents to index very quickly (shall we say, 1000 every few seconds, per thread, and I have many threads on many nodes), and then insert them into ElasticSearch for indexing one-by-one until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of the same type)?

I am currently using the REST API via python, but if this feature exists in a different API instead, it is conceivable that I could incorporate it into my program.

My document type looks like:

{
Name1:
Name2:
Percent:
}

I'm imagining the slowdown is simply because I have to push thousands of documents to the cloud, one-by-one, even though I have large chunks of them generated at once, and the overhead of individual transfers/indexing is the bottleneck.


(Shay Banon) #2

Its important to understand where the bottleneck is. When you say index
documents "into" the cloud, what do you mean? Is that a WAN call?

On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <elasticsearcher@gmail.com

wrote:

I've searched around on the docs, and I haven't found a solution, so I
thought I'd ask here.

In my program, I generate many short documents to index very quickly (shall
we say, 1000 every few seconds, per thread, and I have many threads on many
nodes), and then insert them into ElasticSearch for indexing one-by-one
until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of the
same type)?

I am currently using the REST API via python, but if this feature exists in
a different API instead, it is conceivable that I could incorporate it into
my program.

My document type looks like:

{
Name1:
Name2:
Percent:
}

I'm imagining the slowdown is simply because I have to push thousands of
documents to the cloud, one-by-one, even though I have large chunks of them
generated at once, and the overhead of individual transfers/indexing is the
bottleneck.

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-things-at-once-Possible-tp1317722p1317722.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Mahendra M) #3

Hi,

I also had a similar requirement. I dunno if this solution will work
for you. You can try an alternate approach.

Instead of indexing the documents directly, queue them to a message
queue. (like rabbitmq).

Have consumers which will keep reading from the queue and index the
document into elasticsearch.

This way, by de-coupling your document generation and document
indexing, you need not worry about the rate at which your documents
are being created.

Also, since your documents seem to be small, this will not be much of
an overhead on messaging systems.

If you use a framework like celery, this is done very transparently
for you. You don't have to understand (deeply) about AMQP and similar
technologies.

Assuming that you are doing this on a cloud setup, you may already
have access to a RabbitMQ setup.

Regards,
Mahendra

http://twitter.com/mahendra

On Wed, Aug 25, 2010 at 12:32 AM, elasticsearcher
elasticsearcher@gmail.com wrote:

I've searched around on the docs, and I haven't found a solution, so I
thought I'd ask here.

In my program, I generate many short documents to index very quickly (shall
we say, 1000 every few seconds, per thread, and I have many threads on many
nodes), and then insert them into ElasticSearch for indexing one-by-one
until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of the
same type)?

I am currently using the REST API via python, but if this feature exists in
a different API instead, it is conceivable that I could incorporate it into
my program.

My document type looks like:

{
Name1:
Name2:
Percent:
}

I'm imagining the slowdown is simply because I have to push thousands of
documents to the cloud, one-by-one, even though I have large chunks of them
generated at once, and the overhead of individual transfers/indexing is the
bottleneck.

View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-things-at-once-Possible-tp1317722p1317722.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Mahendra M) #4

Hi,

One more tip. Following in the same line -

You can try out - http://wiki.github.com/jbrisbin/rabbitmq-webhooks/

This can automate the job of listening for messages and indexing to
ElasticSearch.

However, please note that rabbitmq-webhooks is in very early stages of
development (and, as documented, is known to be nasty to RabbitMQ as
of now).

Regards,
Mahendra

On Mon, Aug 30, 2010 at 6:29 PM, Mahendra M mahendra.m@gmail.com wrote:

Hi,

I also had a similar requirement. I dunno if this solution will work
for you. You can try an alternate approach.

Instead of indexing the documents directly, queue them to a message
queue. (like rabbitmq).

Have consumers which will keep reading from the queue and index the
document into elasticsearch.

This way, by de-coupling your document generation and document
indexing, you need not worry about the rate at which your documents
are being created.

Also, since your documents seem to be small, this will not be much of
an overhead on messaging systems.

If you use a framework like celery, this is done very transparently
for you. You don't have to understand (deeply) about AMQP and similar
technologies.

Assuming that you are doing this on a cloud setup, you may already
have access to a RabbitMQ setup.

Regards,
Mahendra

http://twitter.com/mahendra

On Wed, Aug 25, 2010 at 12:32 AM, elasticsearcher
elasticsearcher@gmail.com wrote:

I've searched around on the docs, and I haven't found a solution, so I
thought I'd ask here.

In my program, I generate many short documents to index very quickly (shall
we say, 1000 every few seconds, per thread, and I have many threads on many
nodes), and then insert them into ElasticSearch for indexing one-by-one
until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of the
same type)?

I am currently using the REST API via python, but if this feature exists in
a different API instead, it is conceivable that I could incorporate it into
my program.

My document type looks like:

{
Name1:
Name2:
Percent:
}

I'm imagining the slowdown is simply because I have to push thousands of
documents to the cloud, one-by-one, even though I have large chunks of them
generated at once, and the overhead of individual transfers/indexing is the
bottleneck.

View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-things-at-once-Possible-tp1317722p1317722.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
Mahendra

http://twitter.com/mahendra


(elasticsearcher) #5

In essence, I have a large number of documents which are generated in
Large quantities very very quickly, and I'm looking for the way to
index them as fast as possible. I was wondering if there were a way to
index, say, a batch of documents more quickly than indexing each
document individually.

If this isn't something feasible, would it be possible for me to split
off a few threads to help send indexing requests to elasticsearch more
quickly? My question is really aimed at understanding how
elasticsearch deals with indexing requests. Would it be advantageous
to issue as many indexing requests as possible, or would elasticsearch
start to get overloaded? (Or, at what point would elasticsearch get
overloaded?)

For instance, my cluster is currently running on five regular old pc's
(servers coming soon), each with 2GB ram (1GB allocated to ES), dual
core intel cpu, etc. My program, running on each node, will be
generating lists of, shall we say for simplicity, 1000 documents,
essentially as fast as they can. After generating the 1000 documents,
they currently sit there and submit the documents one-by-one to
elasticsearch for indexing until the documents are all gone. They then
generate a new 1000 documents and repeat the process. My program
already has multiple threads which could all be generating sets of
1000 documents at once, maybe 3000-4000 documents queued up at any
time, on each node.

Since I'm not very familiar with how ES actually does the indexing,
I'm really just looking for advice on how to get my large number of
documents indexed as quickly as possible.

On Aug 27, 3:52 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Its important to understand where the bottleneck is. When you say index
documents "into" the cloud, what do you mean? Is that a WAN call?

On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <elasticsearc...@gmail.com

wrote:

I've searched around on the docs, and I haven't found a solution, so I
thought I'd ask here.

In my program, I generate many short documents to index very quickly (shall
we say, 1000 every few seconds, per thread, and I have many threads on many
nodes), and then insert them into ElasticSearch for indexing one-by-one
until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of the
same type)?

I am currently using the REST API via python, but if this feature exists in
a different API instead, it is conceivable that I could incorporate it into
my program.

My document type looks like:

{
Name1:
Name2:
Percent:
}

I'm imagining the slowdown is simply because I have to push thousands of
documents to the cloud, one-by-one, even though I have large chunks of them
generated at once, and the overhead of individual transfers/indexing is the
bottleneck.

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-thi...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Shay Banon) #6

You should certainly use several threads / processes (/machines) to index
data. When you index data, it gets redirected to the appropriate shard, and
then gets replicated to its replica shards. Usually, when it comes to
indexing, you can monitor the cpu first, io later, and if it gets maxed out,
you are over indexing... .

-shay.banon

On Wed, Sep 1, 2010 at 9:05 PM, elastic searcher
elasticsearcher@gmail.comwrote:

In essence, I have a large number of documents which are generated in
Large quantities very very quickly, and I'm looking for the way to
index them as fast as possible. I was wondering if there were a way to
index, say, a batch of documents more quickly than indexing each
document individually.

If this isn't something feasible, would it be possible for me to split
off a few threads to help send indexing requests to elasticsearch more
quickly? My question is really aimed at understanding how
elasticsearch deals with indexing requests. Would it be advantageous
to issue as many indexing requests as possible, or would elasticsearch
start to get overloaded? (Or, at what point would elasticsearch get
overloaded?)

For instance, my cluster is currently running on five regular old pc's
(servers coming soon), each with 2GB ram (1GB allocated to ES), dual
core intel cpu, etc. My program, running on each node, will be
generating lists of, shall we say for simplicity, 1000 documents,
essentially as fast as they can. After generating the 1000 documents,
they currently sit there and submit the documents one-by-one to
elasticsearch for indexing until the documents are all gone. They then
generate a new 1000 documents and repeat the process. My program
already has multiple threads which could all be generating sets of
1000 documents at once, maybe 3000-4000 documents queued up at any
time, on each node.

Since I'm not very familiar with how ES actually does the indexing,
I'm really just looking for advice on how to get my large number of
documents indexed as quickly as possible.

On Aug 27, 3:52 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Its important to understand where the bottleneck is. When you say index
documents "into" the cloud, what do you mean? Is that a WAN call?

On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <
elasticsearc...@gmail.com

wrote:

I've searched around on the docs, and I haven't found a solution, so I
thought I'd ask here.

In my program, I generate many short documents to index very quickly
(shall

we say, 1000 every few seconds, per thread, and I have many threads on
many

nodes), and then insert them into ElasticSearch for indexing one-by-one
until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of
the

same type)?

I am currently using the REST API via python, but if this feature
exists in

a different API instead, it is conceivable that I could incorporate it
into

my program.

My document type looks like:

{
Name1:
Name2:
Percent:
}

I'm imagining the slowdown is simply because I have to push thousands
of

documents to the cloud, one-by-one, even though I have large chunks of
them

generated at once, and the overhead of individual transfers/indexing is
the

bottleneck.

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-thi.
..

Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Berkay Mollamustafaoglu-2) #7

If you use the async (non blocking) interface, you can index really fast
even if you're sending the docs one by one, a batch process is not really
needed.
If you'll have 5 servers, I'd guess that 3-4K documents would not be an
issue, ES would easily keep up with that. We're able to index 1000 docs with
100 fields each, on a single 4 core CPU PC.

You can watch cpu/io to throttle if necessary. Also, you may want to use
blocking threadpool.
http://www.elasticsearch.com/docs/elasticsearch/modules/threadpool/blocking/

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Wed, Sep 1, 2010 at 2:05 PM, elastic searcher
elasticsearcher@gmail.comwrote:

In essence, I have a large number of documents which are generated in
Large quantities very very quickly, and I'm looking for the way to
index them as fast as possible. I was wondering if there were a way to
index, say, a batch of documents more quickly than indexing each
document individually.

If this isn't something feasible, would it be possible for me to split
off a few threads to help send indexing requests to elasticsearch more
quickly? My question is really aimed at understanding how
elasticsearch deals with indexing requests. Would it be advantageous
to issue as many indexing requests as possible, or would elasticsearch
start to get overloaded? (Or, at what point would elasticsearch get
overloaded?)

For instance, my cluster is currently running on five regular old pc's
(servers coming soon), each with 2GB ram (1GB allocated to ES), dual
core intel cpu, etc. My program, running on each node, will be
generating lists of, shall we say for simplicity, 1000 documents,
essentially as fast as they can. After generating the 1000 documents,
they currently sit there and submit the documents one-by-one to
elasticsearch for indexing until the documents are all gone. They then
generate a new 1000 documents and repeat the process. My program
already has multiple threads which could all be generating sets of
1000 documents at once, maybe 3000-4000 documents queued up at any
time, on each node.

Since I'm not very familiar with how ES actually does the indexing,
I'm really just looking for advice on how to get my large number of
documents indexed as quickly as possible.

On Aug 27, 3:52 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Its important to understand where the bottleneck is. When you say index
documents "into" the cloud, what do you mean? Is that a WAN call?

On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <
elasticsearc...@gmail.com

wrote:

I've searched around on the docs, and I haven't found a solution, so I
thought I'd ask here.

In my program, I generate many short documents to index very quickly
(shall

we say, 1000 every few seconds, per thread, and I have many threads on
many

nodes), and then insert them into ElasticSearch for indexing one-by-one
until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of
the

same type)?

I am currently using the REST API via python, but if this feature
exists in

a different API instead, it is conceivable that I could incorporate it
into

my program.

My document type looks like:

{
Name1:
Name2:
Percent:
}

I'm imagining the slowdown is simply because I have to push thousands
of

documents to the cloud, one-by-one, even though I have large chunks of
them

generated at once, and the overhead of individual transfers/indexing is
the

bottleneck.

View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-thi.
..

Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #8