In terms of an NRT-based index, I was thinking of using a River to have an
Application stack send messages via, say, RabbitMQ to ES with updates. The
pro's for this approach are that updates are stored even if the application
that is making the updates catches fire (no updates are lost, as they are
queued and processed 'eventually', hopefully in NRT ). From looking at the
RabbitMQ river code it looks like a single thread is responsible for
processing the JMS queue, and it does try to do it in a bulk
fashion, coalescing batch messages into larger chunks for efficient updates
to ES. Even better the versioning code ensures any out of order writes are
discarded.
Having said that though, it's still single threaded. In an application
where there are multiple threads (hundreds) producing the updates there's
good chance for overrunning a single threaded queue. In our legacy Lucene
based architecture we used ActiveMQ with multiple threads consuming a queue
to apply updates. That suffers from it's own problems in that there's
eventually contention for updates on specific indexes, so the multiple
threads may not really benefit in practice.
Another obvious other option is to have the application thread itself send
the updates via the Java client (asynchronously as is the default). This
would batter the ES instance with many small updates though and wouldn't
have the luxury of the coalescing strategy the River has. Secondly while it
is asynchronous like the River ensuring the application code is responsive
and not adding latency to user actions, it can suffer from lost updates in
the event the application dies before an update to ES has been sent and
processed. Given a user update action generates an transaction update to
the DB, it's important to try to keep the Index index in sync as close to
possible without the hideousness of having 2PC (gack). With a JMS queue
based approach it means ES becomes sort of eventually-consistent, laggy
maybe, but never lost, as long as the placing of the JMS message in the
queue is part of the same transaction as the DB (although again that's 2PC,
in reality we currently 'live' with a hand-rolled post-db-txn commit hook
that pushes the JMS message, probably no more risky than the simple
async-client approach.
Just wondering on what other strategies people have for a 'high' volume
update volume NRT strategy? To give some context looking at our perf data
right now in peak-ish times we're producing about 10 JMS messages/second
with on average around 15-30 individual 'entity' updates in that payload
(each message may contain multiple entity updates, but a good chunk are just
a small handful, even just 1). I'm sure there's larger volume sites out
there that may be able to comment, I'd love to hear how they're dealing with
this sort of issue.
cheers,
Paul Smith