Hey all,
I'm just starting with ES, and I must say it is a beautiful piece of
software!
I will be using it as my main store for a health-related system. The
information that ES will deal with is a bit more sensible than in other
scenarios, so my main concern is to guaranty its persistence.
In order to do that, I will be storing all the info additionally to a
postgres instance, which will serve its purpose in case (plan C) I have to
rebuild the indexes from scratch, but as well I want to take advantage of
postgres PLV8 language integration to normalize some JSON data that it is
convenient to deal with in a relational fashion.
Now, I know I could just add the logic to do this right in the app, but:
- I would have to be extra careful to guaranty that both ES and postgres
have the same info in sync - It would add an overhead to the app and its development when I only
care about ES access from there
Using a River is not an option, since postgres is not my main data access
point, nor I intend it to be.
What I came up with was to create some sort of ES plugin that would catch
all CUD operations straight from the engine and persist it to postgres from
there. This way I can guaranty that all operations that ES receives and
stores will effectively be backed to postgres, so if I keep my ES
operations atomic and consistent from the app, so it will for postgres.
What I have so far is a plugin that:
- Binds an implementation of IndexingOperationListener
- Catches all operations executed against each primary shard
- Puts them in a transactional queue (currently implemented with
BerkleyDB JE) - Has a worker thread that constantly reads from the queue in batches
and transactionally updates postgres
So far so good, I'm calling it the InnerRiver and I like the approach. Now,
the questions:
- What do you think about it?
- Is there any tip, advice or something I should have in count that you
can think of? - There are many post-something operation hooks
in IndexingOperationListener, which ones should I care about, and which
ones I should not (specially I want to know the difference between the lock
and no lock events)? - Any memory concerns I should be thinking of?
- Some better idea for the transactional queue?
- Is there an elegant way inside ES to execute worker threads that
adjust to the node's lifecycle?
Okay, that's a lot of questions (for now! :-)). If you guys like this idea,
I will gladly contribute it to the community!
Thanks a lot in advance!
Cheers,
Nicolas.
--