Hi,
On Mon, 2010-07-12 at 21:06 +0300, Shay Banon wrote:
Hi,
Regarding the first question, I am not sure why you would want to
extend elasticsearch to do it. You can create a simple program that
uses elasticsearch API and exposes a REST endpoint to do what you
want. This is preferable since you are more isolated form
elasticsearch internal changes.
Yes, this is way my initial idea to implement servlet and deploy it on
tomcat/jetty (since this bulk/batch indexing application needs to be
'executed' over http). This servlet would be using ES java API for bulk
indexing and bulk updates.
But, since ES already runs it's own java server (servlet container of
some kind?), I thought maybe it would be simpler to 'integrate' this
custom servlet in ES web server/container.
Is this possible or would you recommend separate tomcat/jetty with 'bulk
indexer' servlet?
Regarding the second question, again, I think you are trying to
solve this too much "inside" elasticsearch. Where do you plan to store
your data?
Data will be replicated through gateway on share disk (NAS). But, we
would have 2 index versions: one active with 'single' document instance
(used for searching) and one 'historical' with multiple document
versions (each update on document creates a new version, this version
would be used for 'historical' search).
Maybe you can do your versioning there, and always keep only the
latest docs in elasticsearch.
Yes, I think you are right. I guess I can re-post each document which is
posted to 'active ES' for indexing/update to 'historical ES' instance
with different unique key policy (where update will be treated as new
document insert, because of different unique key defined in type
mapping)
Tomislav
-shay.banon
On Mon, Jul 12, 2010 at 11:33 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
just one clarification on the first question:
question is not about how to use ES Java API for batch
indexing (I've
found info/code on this in documentation). I'm interested in
what needs
to be done (extended) to add custom 'web' http/json
command/component
which has an input paramas (http/json) and outputs some
(simple)
response.
Tomislav
On Mon, 2010-07-12 at 10:09 +0200, Tomislav Poljak wrote:
> Hi,
> I've been reviewing and analyzing Elastic Search for one
project and
> from what I've seen so far it seems really great!
>
> I have a question regarding custom (component) development
inside
> Elastic Search to satisfy specific requirements.
>
> First, we need to be able to execute batch indexing (using
ES Java API)
> on HTTP request (location of folder containing data for
indexing would
> be a parameter in request). What would be the best (easiest)
way to
> implement this (by extending ...)?
>
> Second, we need Auditing (what and when was changed in
index) or maybe
> even all versions of 'original json document' stored. I know
an update
> of a Lucene document is actually an insert (new) and delete
(old)
> against Lucene index. Also, probably the simplest way of
having all
> versions of 'original json document' stored (in a _source
field) would
> be to simple insert a new document and not delete the old
document
> version (by using custom non-ES ID as a non-unique docID and
tagging a
> newly inserted document with ACTIVE tag/flag).
>
> Of course, this approach would create multiple instances for
each
> document (after each update) in index, so index would be
much bigger and
> harder to manage + this would be bad for search performance.
>
> Here is my question: would it be possible to implement this
> non-delete-on-update logic in a custom gateway?
>
> For active indexing/search everything would work as usual
and active
> index would be replicated through 'standard gateway' as
usual, but there
> would be an 'Audit/Historical gateway' which would only
insert document
> on update (and not delete the old versions). I understand
that this
> requires different unique ID policy on active vs.
'Audit/Historical
> gateway' index.
>
> Maybe it would be possible to re-use ES's transaction log to
'replay'
> operations on other/'different unique ID policy' index?
>
> Also, if possible, not all fields from active index need to
be
> replicated in Audit/Historical 'all versions stored' index.
>
> Is there a simple way of implementing 'all versions of
document' history
> in Elastic Search?
>
>
> Tomislav
>
>
>