On Mon, 2010-07-12 at 21:06 +0300, Shay Banon wrote:
Regarding the first question, I am not sure why you would want to
extend elasticsearch to do it. You can create a simple program that
uses elasticsearch API and exposes a REST endpoint to do what you
want. This is preferable since you are more isolated form
elasticsearch internal changes.
Yes, this is way my initial idea to implement servlet and deploy it on
tomcat/jetty (since this bulk/batch indexing application needs to be
'executed' over http). This servlet would be using ES java API for bulk
indexing and bulk updates.
But, since ES already runs it's own java server (servlet container of
some kind?), I thought maybe it would be simpler to 'integrate' this
custom servlet in ES web server/container.
Is this possible or would you recommend separate tomcat/jetty with 'bulk
Regarding the second question, again, I think you are trying to
solve this too much "inside" elasticsearch. Where do you plan to store
Data will be replicated through gateway on share disk (NAS). But, we
would have 2 index versions: one active with 'single' document instance
(used for searching) and one 'historical' with multiple document
versions (each update on document creates a new version, this version
would be used for 'historical' search).
Maybe you can do your versioning there, and always keep only the
latest docs in elasticsearch.
Yes, I think you are right. I guess I can re-post each document which is
posted to 'active ES' for indexing/update to 'historical ES' instance
with different unique key policy (where update will be treated as new
document insert, because of different unique key defined in type
On Mon, Jul 12, 2010 at 11:33 AM, Tomislav Poljak email@example.com
just one clarification on the first question:
question is not about how to use ES Java API for batch
found info/code on this in documentation). I'm interested in
to be done (extended) to add custom 'web' http/json
which has an input paramas (http/json) and outputs some
On Mon, 2010-07-12 at 10:09 +0200, Tomislav Poljak wrote:
> I've been reviewing and analyzing Elastic Search for one
> from what I've seen so far it seems really great!
> I have a question regarding custom (component) development
> Elastic Search to satisfy specific requirements.
> First, we need to be able to execute batch indexing (using
ES Java API)
> on HTTP request (location of folder containing data for
> be a parameter in request). What would be the best (easiest)
> implement this (by extending ...)?
> Second, we need Auditing (what and when was changed in
index) or maybe
> even all versions of 'original json document' stored. I know
> of a Lucene document is actually an insert (new) and delete
> against Lucene index. Also, probably the simplest way of
> versions of 'original json document' stored (in a _source
> be to simple insert a new document and not delete the old
> version (by using custom non-ES ID as a non-unique docID and
> newly inserted document with ACTIVE tag/flag).
> Of course, this approach would create multiple instances for
> document (after each update) in index, so index would be
much bigger and
> harder to manage + this would be bad for search performance.
> Here is my question: would it be possible to implement this
> non-delete-on-update logic in a custom gateway?
> For active indexing/search everything would work as usual
> index would be replicated through 'standard gateway' as
usual, but there
> would be an 'Audit/Historical gateway' which would only
> on update (and not delete the old versions). I understand
> requires different unique ID policy on active vs.
> gateway' index.
> Maybe it would be possible to re-use ES's transaction log to
> operations on other/'different unique ID policy' index?
> Also, if possible, not all fields from active index need to
> replicated in Audit/Historical 'all versions stored' index.
> Is there a simple way of implementing 'all versions of
> in Elastic Search?