Custom components

Hi,
I've been reviewing and analyzing Elastic Search for one project and
from what I've seen so far it seems really great!

I have a question regarding custom (component) development inside
Elastic Search to satisfy specific requirements.

First, we need to be able to execute batch indexing (using ES Java API)
on HTTP request (location of folder containing data for indexing would
be a parameter in request). What would be the best (easiest) way to
implement this (by extending ...)?

Second, we need Auditing (what and when was changed in index) or maybe
even all versions of 'original json document' stored. I know an update
of a Lucene document is actually an insert (new) and delete (old)
against Lucene index. Also, probably the simplest way of having all
versions of 'original json document' stored (in a _source field) would
be to simple insert a new document and not delete the old document
version (by using custom non-ES ID as a non-unique docID and tagging a
newly inserted document with ACTIVE tag/flag).

Of course, this approach would create multiple instances for each
document (after each update) in index, so index would be much bigger and
harder to manage + this would be bad for search performance.

Here is my question: would it be possible to implement this
non-delete-on-update logic in a custom gateway?

For active indexing/search everything would work as usual and active
index would be replicated through 'standard gateway' as usual, but there
would be an 'Audit/Historical gateway' which would only insert document
on update (and not delete the old versions). I understand that this
requires different unique ID policy on active vs. 'Audit/Historical
gateway' index.

Maybe it would be possible to re-use ES's transaction log to 'replay'
operations on other/'different unique ID policy' index?

Also, if possible, not all fields from active index need to be
replicated in Audit/Historical 'all versions stored' index.

Is there a simple way of implementing 'all versions of document' history
in Elastic Search?

Tomislav

Hi,
just one clarification on the first question:

question is not about how to use ES Java API for batch indexing (I've
found info/code on this in documentation). I'm interested in what needs
to be done (extended) to add custom 'web' http/json command/component
which has an input paramas (http/json) and outputs some (simple)
response.

Tomislav

On Mon, 2010-07-12 at 10:09 +0200, Tomislav Poljak wrote:

Hi,
I've been reviewing and analyzing Elastic Search for one project and
from what I've seen so far it seems really great!

I have a question regarding custom (component) development inside
Elastic Search to satisfy specific requirements.

First, we need to be able to execute batch indexing (using ES Java API)
on HTTP request (location of folder containing data for indexing would
be a parameter in request). What would be the best (easiest) way to
implement this (by extending ...)?

Second, we need Auditing (what and when was changed in index) or maybe
even all versions of 'original json document' stored. I know an update
of a Lucene document is actually an insert (new) and delete (old)
against Lucene index. Also, probably the simplest way of having all
versions of 'original json document' stored (in a _source field) would
be to simple insert a new document and not delete the old document
version (by using custom non-ES ID as a non-unique docID and tagging a
newly inserted document with ACTIVE tag/flag).

Of course, this approach would create multiple instances for each
document (after each update) in index, so index would be much bigger and
harder to manage + this would be bad for search performance.

Here is my question: would it be possible to implement this
non-delete-on-update logic in a custom gateway?

For active indexing/search everything would work as usual and active
index would be replicated through 'standard gateway' as usual, but there
would be an 'Audit/Historical gateway' which would only insert document
on update (and not delete the old versions). I understand that this
requires different unique ID policy on active vs. 'Audit/Historical
gateway' index.

Maybe it would be possible to re-use ES's transaction log to 'replay'
operations on other/'different unique ID policy' index?

Also, if possible, not all fields from active index need to be
replicated in Audit/Historical 'all versions stored' index.

Is there a simple way of implementing 'all versions of document' history
in Elastic Search?

Tomislav

Hi,

Regarding the first question, I am not sure why you would want to extend
elasticsearch to do it. You can create a simple program that uses
elasticsearch API and exposes a REST endpoint to do what you want. This is
preferable since you are more isolated form elasticsearch internal changes.

Regarding the second question, again, I think you are trying to solve
this too much "inside" elasticsearch. Where do you plan to store your data?
Maybe you can do your versioning there, and always keep only the latest docs
in elasticsearch.

-shay.banon

On Mon, Jul 12, 2010 at 11:33 AM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi,
just one clarification on the first question:

question is not about how to use ES Java API for batch indexing (I've
found info/code on this in documentation). I'm interested in what needs
to be done (extended) to add custom 'web' http/json command/component
which has an input paramas (http/json) and outputs some (simple)
response.

Tomislav

On Mon, 2010-07-12 at 10:09 +0200, Tomislav Poljak wrote:

Hi,
I've been reviewing and analyzing Elastic Search for one project and
from what I've seen so far it seems really great!

I have a question regarding custom (component) development inside
Elastic Search to satisfy specific requirements.

First, we need to be able to execute batch indexing (using ES Java API)
on HTTP request (location of folder containing data for indexing would
be a parameter in request). What would be the best (easiest) way to
implement this (by extending ...)?

Second, we need Auditing (what and when was changed in index) or maybe
even all versions of 'original json document' stored. I know an update
of a Lucene document is actually an insert (new) and delete (old)
against Lucene index. Also, probably the simplest way of having all
versions of 'original json document' stored (in a _source field) would
be to simple insert a new document and not delete the old document
version (by using custom non-ES ID as a non-unique docID and tagging a
newly inserted document with ACTIVE tag/flag).

Of course, this approach would create multiple instances for each
document (after each update) in index, so index would be much bigger and
harder to manage + this would be bad for search performance.

Here is my question: would it be possible to implement this
non-delete-on-update logic in a custom gateway?

For active indexing/search everything would work as usual and active
index would be replicated through 'standard gateway' as usual, but there
would be an 'Audit/Historical gateway' which would only insert document
on update (and not delete the old versions). I understand that this
requires different unique ID policy on active vs. 'Audit/Historical
gateway' index.

Maybe it would be possible to re-use ES's transaction log to 'replay'
operations on other/'different unique ID policy' index?

Also, if possible, not all fields from active index need to be
replicated in Audit/Historical 'all versions stored' index.

Is there a simple way of implementing 'all versions of document' history
in Elastic Search?

Tomislav

Hi,

On Mon, 2010-07-12 at 21:06 +0300, Shay Banon wrote:

Hi,

Regarding the first question, I am not sure why you would want to
extend elasticsearch to do it. You can create a simple program that
uses elasticsearch API and exposes a REST endpoint to do what you
want. This is preferable since you are more isolated form
elasticsearch internal changes.

Yes, this is way my initial idea to implement servlet and deploy it on
tomcat/jetty (since this bulk/batch indexing application needs to be
'executed' over http). This servlet would be using ES java API for bulk
indexing and bulk updates.

But, since ES already runs it's own java server (servlet container of
some kind?), I thought maybe it would be simpler to 'integrate' this
custom servlet in ES web server/container.

Is this possible or would you recommend separate tomcat/jetty with 'bulk
indexer' servlet?

Regarding the second question, again, I think you are trying to
solve this too much "inside" elasticsearch. Where do you plan to store
your data?

Data will be replicated through gateway on share disk (NAS). But, we
would have 2 index versions: one active with 'single' document instance
(used for searching) and one 'historical' with multiple document
versions (each update on document creates a new version, this version
would be used for 'historical' search).

Maybe you can do your versioning there, and always keep only the
latest docs in elasticsearch.

Yes, I think you are right. I guess I can re-post each document which is
posted to 'active ES' for indexing/update to 'historical ES' instance
with different unique key policy (where update will be treated as new
document insert, because of different unique key defined in type
mapping)

Tomislav

-shay.banon

On Mon, Jul 12, 2010 at 11:33 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
just one clarification on the first question:

    question is not about how to use ES Java API for batch
    indexing (I've
    found info/code on this in documentation). I'm interested in
    what needs
    to be done (extended) to add custom 'web' http/json
    command/component
    which has an input paramas (http/json) and outputs some
    (simple)
    response.
    
    Tomislav
    
    
    
    
    On Mon, 2010-07-12 at 10:09 +0200, Tomislav Poljak wrote:
    > Hi,
    > I've been reviewing and analyzing Elastic Search for one
    project and
    > from what I've seen so far it seems really great!
    >
    > I have a question regarding custom (component) development
    inside
    > Elastic Search to satisfy specific requirements.
    >
    > First, we need to be able to execute batch indexing (using
    ES Java API)
    > on HTTP request (location of folder containing data for
    indexing would
    > be a parameter in request). What would be the best (easiest)
    way to
    > implement this (by extending ...)?
    >
    > Second, we need Auditing (what and when was changed in
    index) or maybe
    > even all versions of 'original json document' stored. I know
    an update
    > of a Lucene document is actually an insert (new) and delete
    (old)
    > against Lucene index. Also, probably the simplest way of
    having all
    > versions of 'original json document' stored (in a _source
    field) would
    > be to simple insert a new document and not delete the old
    document
    > version (by using custom non-ES ID as a non-unique docID and
    tagging a
    > newly inserted document with ACTIVE tag/flag).
    >
    > Of course, this approach would create multiple instances for
    each
    > document (after each update) in index, so index would be
    much bigger and
    > harder to manage + this would be bad for search performance.
    >
    > Here is my question: would it be possible to implement this
    > non-delete-on-update logic in a custom gateway?
    >
    > For active indexing/search everything would work as usual
    and active
    > index would be replicated through 'standard gateway' as
    usual, but there
    > would be an 'Audit/Historical gateway' which would only
    insert document
    > on update (and not delete the old versions). I understand
    that this
    > requires different unique ID policy on active vs.
    'Audit/Historical
    > gateway' index.
    >
    > Maybe it would be possible to re-use ES's transaction log to
    'replay'
    > operations on other/'different unique ID policy' index?
    >
    > Also, if possible, not all fields from active index need to
    be
    > replicated in Audit/Historical 'all versions stored' index.
    >
    > Is there a simple way of implementing 'all versions of
    document' history
    > in Elastic Search?
    >
    >
    > Tomislav
    >
    >
    >

Hey, answers below:

On Tue, Jul 13, 2010 at 1:49 PM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi,

On Mon, 2010-07-12 at 21:06 +0300, Shay Banon wrote:

Hi,

Regarding the first question, I am not sure why you would want to
extend elasticsearch to do it. You can create a simple program that
uses elasticsearch API and exposes a REST endpoint to do what you
want. This is preferable since you are more isolated form
elasticsearch internal changes.

Yes, this is way my initial idea to implement servlet and deploy it on
tomcat/jetty (since this bulk/batch indexing application needs to be
'executed' over http). This servlet would be using ES java API for bulk
indexing and bulk updates.

But, since ES already runs it's own java server (servlet container of
some kind?), I thought maybe it would be simpler to 'integrate' this
custom servlet in ES web server/container.

Is this possible or would you recommend separate tomcat/jetty with 'bulk
indexer' servlet?

I would recommend doing it externally to elasticsearch. ES does not use
jetty or tomcat, since implementing non blocking / async IO with operations
is not very well implemented in them. It has a nice abstraction to plug your
own REST services and so on, but its not aimed at hosting your code except
for extension to ES.

Regarding the second question, again, I think you are trying to
solve this too much "inside" elasticsearch. Where do you plan to store
your data?

Data will be replicated through gateway on share disk (NAS). But, we
would have 2 index versions: one active with 'single' document instance
(used for searching) and one 'historical' with multiple document
versions (each update on document creates a new version, this version
would be used for 'historical' search).

Maybe you can do your versioning there, and always keep only the
latest docs in elasticsearch.

Yes, I think you are right. I guess I can re-post each document which is
posted to 'active ES' for indexing/update to 'historical ES' instance
with different unique key policy (where update will be treated as new
document insert, because of different unique key defined in type
mapping)

Make sense.

Tomislav

-shay.banon

On Mon, Jul 12, 2010 at 11:33 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
just one clarification on the first question:

    question is not about how to use ES Java API for batch
    indexing (I've
    found info/code on this in documentation). I'm interested in
    what needs
    to be done (extended) to add custom 'web' http/json
    command/component
    which has an input paramas (http/json) and outputs some
    (simple)
    response.

    Tomislav




    On Mon, 2010-07-12 at 10:09 +0200, Tomislav Poljak wrote:
    > Hi,
    > I've been reviewing and analyzing Elastic Search for one
    project and
    > from what I've seen so far it seems really great!
    >
    > I have a question regarding custom (component) development
    inside
    > Elastic Search to satisfy specific requirements.
    >
    > First, we need to be able to execute batch indexing (using
    ES Java API)
    > on HTTP request (location of folder containing data for
    indexing would
    > be a parameter in request). What would be the best (easiest)
    way to
    > implement this (by extending ...)?
    >
    > Second, we need Auditing (what and when was changed in
    index) or maybe
    > even all versions of 'original json document' stored. I know
    an update
    > of a Lucene document is actually an insert (new) and delete
    (old)
    > against Lucene index. Also, probably the simplest way of
    having all
    > versions of 'original json document' stored (in a _source
    field) would
    > be to simple insert a new document and not delete the old
    document
    > version (by using custom non-ES ID as a non-unique docID and
    tagging a
    > newly inserted document with ACTIVE tag/flag).
    >
    > Of course, this approach would create multiple instances for
    each
    > document (after each update) in index, so index would be
    much bigger and
    > harder to manage + this would be bad for search performance.
    >
    > Here is my question: would it be possible to implement this
    > non-delete-on-update logic in a custom gateway?
    >
    > For active indexing/search everything would work as usual
    and active
    > index would be replicated through 'standard gateway' as
    usual, but there
    > would be an 'Audit/Historical gateway' which would only
    insert document
    > on update (and not delete the old versions). I understand
    that this
    > requires different unique ID policy on active vs.
    'Audit/Historical
    > gateway' index.
    >
    > Maybe it would be possible to re-use ES's transaction log to
    'replay'
    > operations on other/'different unique ID policy' index?
    >
    > Also, if possible, not all fields from active index need to
    be
    > replicated in Audit/Historical 'all versions stored' index.
    >
    > Is there a simple way of implementing 'all versions of
    document' history
    > in Elastic Search?
    >
    >
    > Tomislav
    >
    >
    >