Hi,
a river is created in instances, so, each river instance is a
cluster-wide singleton.
It is up to the river how indexing is executed.
Note, bulk updates of existing documents are not implemented in ES. You
can mix create, insert (overwrite), or delete operations in bulk mode.
For updating existing documents in bulk mode, the bulk indexer will need
an additional notion of how to order incoming requests (receiving
operation requests from many concurrent sources, something like vector
clocks) which is not there yet.
I do not understand what "better performance" exactly means, can you
specify what kind of performance you are interested in? Less system
load? Faster ingest of baseline loading? Lower latency? Higher indexing
throughput?
If the input can be segmented (RabbitMQ comes with the notion of
channels) you can run many rivers in parallel. You can distribute river
instances over many nodes, just install a river instance for a channel
on the node you want.
In many cases you don't need to increase the "machine size" (assuming
you are addressing RAM and disk size). I don't know what size you have
in mind, so there is no good answer. Note, ES can horizontally scale
very well, that is, instead of a big machine you can use small machines
but just some more of them.
If you increase the number of shards, you may have better distribution
of load, but a bit slower overall indexing (which is acceptable).
You mention indexing_buffer, what do you mean? Do you mean
"max_shard_index_buffer_size" ?
Best regards,
Jörg
Am 26.02.13 06:24, schrieb rockbobsta:
Hi,
Any input or suggestions for the following would be welcome:
I am working on a project which aims to store some dynamic data about
customers within Elastic search.
The idea is that we can keep a profile of customer actions and be able
to search easily on various attributes of the customers.
The customer profile will be updated based on messages coming into a
RabbitMQ queue.
So the approach we are currently taking is to modify the RabbitMQRiver
plugin and instead of doing bulk updates, performing upserts on the
customers based on their ID.
The query load on ES will not be particularly high, so we have tried
to optimise the cluster (currently only 2 nodes) for indexing rather
and querying performance.
Does this sound like a reasonable approach? At this stage our river is
processing around 300-500 message/s from the RabbitMQ queue.
It appears that a River runs as a singleton within only one node of
the cluster. With that in mind, would it be possible to get better
performance having multiple workers listening on the rabbitMQ queue
and individually executing upserts to the ES cluster?
Additionally, as the River only runs on a single node, does this imply
that to scale up and process more messages, the best option is to
increase the machine size?
We have tried the following to boost the speed of the processing as well:
- increased default shards to 20
- increased the indexing_buffer to 20%
thanks for any advice
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.