Hi Shay, would you mind giving some pointers on how exactly one should
create a plugin to register for indexing operations? Where can I find the
hooks?
Regards
On Saturday, August 13, 2011 3:35:24 PM UTC-4, kimchy wrote:
Let me do a quick brain dump here, and try to explain what needs to be
done to properly support this:First, one can (with my help, or looking at the code) write a plugin that
registers for indexing operations. The listener can also check and make
sure to only process events that happen on a primary shard (so they won't
be processed on the replica, if a design requires it). But, to be honest,
this is the easy part.For changes feed, one has two options, pull and push.
Lets start with push. Push notifications are quire simple to implement in
a non distributed solution (like redis does). People register listeners and
every time an operation happens, the listeners get notified. It does
require some thought as to how to publish those events. If one controls the
clients as well, then its simple (i.e. doing it only for the Java API),
but, since elasticsearch treats HTTP as a first class citizen, a solution
for HTTP needs to be built as well. This can be similar to pusspubsub...(
PubSubHubbub · GitHub), but then the clients needs also
to listen for HTTP requests.Also, with push notifications, there is a question of do we only send
"new" events to the listener, or also send all the current data (possibly
filtered) and later, new events happening. There is a question of what to
do with misbehaving endpoints that don't process notifications fast enough
(tricky to identify it...), block them, drop them, or something similar.Also, there is a question if the listeners are persistent. If they go
away, do we queue events and send it to them once they reconnect?Now, lets move to a distributed solution. Lets start with simple HA,
replication. Now, we need to make sure that listeners registrations are
persisted across the cluster (and possibly surviving full cluster restart).
Also, we need to make sure as shards move around that those listeners move
around with them. Also, if we support persistent notifications, we need the
queue of future events that need to be sent to disconnected clients is
replicated as well (and we need to recover them, get this data into hot
relocation of shards, and so on).Now, lets talk about pull notifications, which is similar to how couchdb
does things. First, a note on couchdb. The data structure couchdb has
(basically, a never ending (up to compaction) btree) is a big boon when it
comes to implementing pull notifications. elasticsearch/lucene do not work
like that.Pull notification will probably require API based invocation of give me
changes since X. X can be a timestamp, or an id that denotes some sort of
"timeability"/order. A user will need to register the fact that it starts
listener, and we in elasticsearch can make sure that any changes are kept
around for the next pull request the user does (either on an open HTTP
connection, or per request, does not really matter). This is a bit simpler
to implement in elasticsearch, we can keep the transaction log around long
enough till we notified all clients about the changes, and, it allows us to
do async notifications more easily. But, it still requires delicate control
over the transaction log and when we can safely "get rid" of it.Also, pull notification require thought as to how to provide all the
"current" data in elasticsearch, Again, its certainly possible, and the
user can provide a query that will filter that data if not all data is
needed.In terms of the internals of how elasticsearch works, pull notification is
simpler, but still require delicate work when it comes to concurrency,
transaction log handling, that are pretty low level... . Not simple.Summary:
One of the things left on the plate for elasticsearch is cross data center
replication. I would love to implement it in a way that cross data center
replication mechanism is open enough for users to use. What does it mean?
For example, if we do pull based notifications, we can possibly utilize
that for cross data center replication. Another cluster, halfway around the
world, is just another user of the pull based notifications.Hope things make a bit more sense now...
On Sat, Aug 13, 2011 at 8:30 PM, David Richardson <david.ri...@enquora.com<javascript:>
wrote:
How then would one push change events into rabbitmq, or some other
message broker. Not my preferred mechanism, since rabbitmq isn't
distributed, but perhaps that isn't so important for change events. Soft
realtime required, polling not allowed. Doing this "in the (external) app"
isn't viable.Change notifications and WAN replication are really the only things
missing in es that preclude decommissioning our couchdb infrastructure -
which we would like to do since virtually every query against must already
go through external search. Getting change events into an external message
broker provides an immediate solution to both, but perhaps that's no easier
than an internal changes feed.btw, Postgresql provides an even better model for external notificationshttp://www.postgresql.org/docs/9.0/interactive/sql-notify.htmlimho - multiple channels plus a programmable payload. Have no experience
with it at extreme load, but under moderate load it works wonderfully.
Again, radically different technical environment - it's the api model
that's of interest. What we're talking about for ES is a river producer,
rather than consumer.cheers,
d.r.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.