Let me do a quick brain dump here, and try to explain what needs to be done
to properly support this:
First, one can (with my help, or looking at the code) write a plugin that
registers for indexing operations. The listener can also check and make sure
to only process events that happen on a primary shard (so they won't be
processed on the replica, if a design requires it). But, to be honest, this
is the easy part.
For changes feed, one has two options, pull and push.
Lets start with push. Push notifications are quire simple to implement in a
non distributed solution (like redis does). People register listeners and
every time an operation happens, the listeners get notified. It does require
some thought as to how to publish those events. If one controls the clients
as well, then its simple (i.e. doing it only for the Java API), but, since
elasticsearch treats HTTP as a first class citizen, a solution for HTTP
needs to be built as well. This can be similar to pusspubsub...(
PubSubHubbub · GitHub), but then the clients needs also to
listen for HTTP requests.
Also, with push notifications, there is a question of do we only send "new"
events to the listener, or also send all the current data (possibly
filtered) and later, new events happening. There is a question of what to do
with misbehaving endpoints that don't process notifications fast enough
(tricky to identify it...), block them, drop them, or something similar.
Also, there is a question if the listeners are persistent. If they go away,
do we queue events and send it to them once they reconnect?
Now, lets move to a distributed solution. Lets start with simple HA,
replication. Now, we need to make sure that listeners registrations are
persisted across the cluster (and possibly surviving full cluster restart).
Also, we need to make sure as shards move around that those listeners move
around with them. Also, if we support persistent notifications, we need the
queue of future events that need to be sent to disconnected clients is
replicated as well (and we need to recover them, get this data into hot
relocation of shards, and so on).
Now, lets talk about pull notifications, which is similar to how couchdb
does things. First, a note on couchdb. The data structure couchdb has
(basically, a never ending (up to compaction) btree) is a big boon when it
comes to implementing pull notifications. elasticsearch/lucene do not work
like that.
Pull notification will probably require API based invocation of give me
changes since X. X can be a timestamp, or an id that denotes some sort of
"timeability"/order. A user will need to register the fact that it starts
listener, and we in elasticsearch can make sure that any changes are kept
around for the next pull request the user does (either on an open HTTP
connection, or per request, does not really matter). This is a bit simpler
to implement in elasticsearch, we can keep the transaction log around long
enough till we notified all clients about the changes, and, it allows us to
do async notifications more easily. But, it still requires delicate control
over the transaction log and when we can safely "get rid" of it.
Also, pull notification require thought as to how to provide all the
"current" data in elasticsearch, Again, its certainly possible, and the user
can provide a query that will filter that data if not all data is needed.
In terms of the internals of how elasticsearch works, pull notification is
simpler, but still require delicate work when it comes to concurrency,
transaction log handling, that are pretty low level... . Not simple.
Summary:
One of the things left on the plate for elasticsearch is cross data center
replication. I would love to implement it in a way that cross data center
replication mechanism is open enough for users to use. What does it mean?
For example, if we do pull based notifications, we can possibly utilize that
for cross data center replication. Another cluster, halfway around the
world, is just another user of the pull based notifications.
Hope things make a bit more sense now...
On Sat, Aug 13, 2011 at 8:30 PM, David Richardson <
david.richardson@enquora.com> wrote:
How then would one push change events into rabbitmq, or some other message
broker. Not my preferred mechanism, since rabbitmq isn't distributed, but
perhaps that isn't so important for change events. Soft realtime required,
polling not allowed. Doing this "in the (external) app" isn't viable.
Change notifications and WAN replication are really the only things missing
in es that preclude decommissioning our couchdb infrastructure - which we
would like to do since virtually every query against must already go through
external search. Getting change events into an external message broker
provides an immediate solution to both, but perhaps that's no easier than an
internal changes feed.
btw, Postgresql provides an even better model for external notificationshttp://www.postgresql.org/docs/9.0/interactive/sql-notify.htmlimho - multiple channels plus a programmable payload. Have no experience
with it at extreme load, but under moderate load it works wonderfully.
Again, radically different technical environment - it's the api model that's
of interest. What we're talking about for ES is a river producer, rather
than consumer.
cheers,
d.r.