Permanent and selective transaction log


(Otis Gospodnetić) #1

Hello,

I'm curious about the Gateway and its transaction log. More
precisely, I am wondering whether one can keep this transaction log
permanently and whether one can configure ES to keep only certain
types of transactions in the log (e.g. keep only doc modifications,
but not additions)?

Use case:
The use case is a system that indexes some data (say static files in
the file system), then allows one to modify indexed documents, but
doesn't propagate those changes to the original data (say those static
files). Such a system is problematic if one has to reindex the
original data. If that has to be done, all changes (which were
applied only directly to the documents in the index) would be gone.

XA log role:

  • If one can keep that transaction log forever, then one could replay
    all document "edits" and get the previous state of the index.
    If one can store only updates (or maybe updates + deletions) in the
    transaction log, then only those could be re-applied and document
    addition can remain in an external indexing application (say an app
    that indexes files from a file system).

Thanks,
Otis

Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


(Shay Banon) #2

Hi Otis,

The aim of the transaction log in elasticsearch is to basically make sure
that once you index data into elasticsearch it will be there, without the
need to perform a "commit" on lucene each time (which will kill
performance). This means that when a recovery happens, either from gateway
or from another node, the actual index files are recovered, and then the
transaction log is replayed. When an actual flush (in elasticsearch lingo)
happens, a commit on Lucene is performed and a new transaction log is
created.

I think that what you are after is a different kind of transaction log,
which has different characteristics from the current elasticsearch
transaction log. Those differences make it big enough, I think, to not try
and piggyback the current ES transaction log with what you are after. The
current ES transaction log is very simple to implement because of the very
defined task it was created for.

A feature such as you are after does make a lot of sense. I believe that
it can be implemented in ES, but will require a whole new "module" for that,
or implemented on top of ES.

-shay.banon

On Thu, Jun 10, 2010 at 11:02 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

I'm curious about the Gateway and its transaction log. More
precisely, I am wondering whether one can keep this transaction log
permanently and whether one can configure ES to keep only certain
types of transactions in the log (e.g. keep only doc modifications,
but not additions)?

Use case:
The use case is a system that indexes some data (say static files in
the file system), then allows one to modify indexed documents, but
doesn't propagate those changes to the original data (say those static
files). Such a system is problematic if one has to reindex the
original data. If that has to be done, all changes (which were
applied only directly to the documents in the index) would be gone.

XA log role:

  • If one can keep that transaction log forever, then one could replay
    all document "edits" and get the previous state of the index.
    If one can store only updates (or maybe updates + deletions) in the
    transaction log, then only those could be re-applied and document
    addition can remain in an external indexing application (say an app
    that indexes files from a file system).

Thanks,
Otis

Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


(Lukáš Vlček) #3

I think what Otis is after is already doable. How about the following:
Instead of keeping certain types of operations in permanent transaction log
it should be possible to index them as a specific documents in new index.
Because elastic search keeps source of documents directly in index (by
default) then it should be possible to use this data when reindexing source
documents. This would result in three indices:
a) original documents - can be reindexed from source
b) update operations - can be permanently stored via gateway
c) "merge" of a) and b) - only this index is used for search by users.

It shouldn't be hard to implement this with current API and one does not
have to bother with transaction logs

Regards,
Lukas

On Thu, Jun 10, 2010 at 10:41 PM, Shay Banon
shay.banon@elasticsearch.comwrote:

Hi Otis,

The aim of the transaction log in elasticsearch is to basically make
sure that once you index data into elasticsearch it will be there, without
the need to perform a "commit" on lucene each time (which will kill
performance). This means that when a recovery happens, either from gateway
or from another node, the actual index files are recovered, and then the
transaction log is replayed. When an actual flush (in elasticsearch lingo)
happens, a commit on Lucene is performed and a new transaction log is
created.

I think that what you are after is a different kind of transaction log,
which has different characteristics from the current elasticsearch
transaction log. Those differences make it big enough, I think, to not try
and piggyback the current ES transaction log with what you are after. The
current ES transaction log is very simple to implement because of the very
defined task it was created for.

A feature such as you are after does make a lot of sense. I believe that
it can be implemented in ES, but will require a whole new "module" for that,
or implemented on top of ES.

-shay.banon

On Thu, Jun 10, 2010 at 11:02 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

I'm curious about the Gateway and its transaction log. More
precisely, I am wondering whether one can keep this transaction log
permanently and whether one can configure ES to keep only certain
types of transactions in the log (e.g. keep only doc modifications,
but not additions)?

Use case:
The use case is a system that indexes some data (say static files in
the file system), then allows one to modify indexed documents, but
doesn't propagate those changes to the original data (say those static
files). Such a system is problematic if one has to reindex the
original data. If that has to be done, all changes (which were
applied only directly to the documents in the index) would be gone.

XA log role:

  • If one can keep that transaction log forever, then one could replay
    all document "edits" and get the previous state of the index.
    If one can store only updates (or maybe updates + deletions) in the
    transaction log, then only those could be re-applied and document
    addition can remain in an external indexing application (say an app
    that indexes files from a file system).

Thanks,
Otis

Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


(system) #4