Details of how transaction log is managed during indexing


(Amit Soni) #1

Hi - I am curious to know the specific detail (sequence of steps involved)
when an indexing request is sent to elastic search and how transaction logs
are managed? It would be great if anyone can point me to a right article
which covers this stuff in good level of details.

For instance, a few questions that I am hoping to get answered are:

  1. In the default (quorum) mode, is it the master node which determines
    which of the shards have to be part of the quorum? Or is this figured out
    by the elastic search node client (which is aware of cluster state and sits
    in the client application code)
  2. Once the shards are identified where write would happen, is the
    request returned with OK status soon after the stuff is written to
    transaction log?
  3. Is transaction log maintained at shard level (each shard has its own
    transaction log entry)
  4. Is it a good practice to store transaction log and index data in
    separate storage, just to be safe? If so, what property controls this
    behavior?

-Amit.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAOGaQ%2B088fDmR72V7TORD%2BLw5mmD5bMcASMWuN-gfw_x5msgg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

Hey Amit!

Some answers inlined.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 11 déc. 2013 à 04:35, Amit Soni amitsoni29@gmail.com a écrit :

Hi - I am curious to know the specific detail (sequence of steps involved) when an indexing request is sent to elastic search and how transaction logs are managed? It would be great if anyone can point me to a right article which covers this stuff in good level of details.

For instance, a few questions that I am hoping to get answered are:

In the default (quorum) mode, is it the master node which determines which of the shards have to be part of the quorum? Or is this figured out by the elastic search node client (which is aware of cluster state and sits in the client application code)
When an index request happen, elasticsearch send it to all shards (primary first and then all replicas) whatever the replication mode is.
Elasticsearch won't fail if a minimum of shards (quorum by default) answers OK.
Once the shards are identified where write would happen, is the request returned with OK status soon after the stuff is written to transaction log?
Yes. The primary shard can fail just after sending IndexResponse OK and elasticsearch needs to promote a replica as a primary with a synchronized transaction log.
Is transaction log maintained at shard level (each shard has its own transaction log entry)
Yes.
Is it a good practice to store transaction log and index data in separate storage, just to be safe? If so, what property controls this behavior?
Not sure you can and if it's a best practice. I have never heard about it so I think you should not worry about it.

What others think?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41C9CD54-2880-48EC-8D53-BDD652E198EF%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #3

In the world of traditional database, the transaction log is often kept on
separate devices / disk drives. This is reasonable because if a database
file gets corrupted, the transaction log must be kept in a safe place to
ensure full recovery. The idea is that database files contain saved
transactions and a volatile part which may get corrupted and can be
repaired by replaying the transaction log. The concept is based on the ACID
paradigm.

From what I understand, ES translogs are for communicating with Lucene.
Each Lucene operation is preceded by a write to the translog. By doing
this, request for Lucene operations are persisted and can not be
accidentally dropped. Lucene also does not know about transactional
operations and manages an internal cache in RAM. From time to time, the
Lucene segments are written to disk in append-only fashion (sync
operation). Lucene passes this operations to the JVM and the JVM passes it
to the operating system.

I'm not sure if ES translogs can be used to repair Lucene indexes by
replaying it, similar to the traditional database world. In case of low
disk situations, I have the experience it may succeed, but it does not
always succeed. The situation is often that ES translog is still intact and
Lucene files ran into out of disk space, but it can also occur that both ES
and Lucene tried to flush data to disk in failure, and ES can recover only
after the translog file got removed (at least in older ES versions).

The gateway concept is more similar to transactional processing. At node
startup, the gateway orchestrates the shards of an index and invokes
recovery with help from replica shards in order to recreate a valid index.

My conclusion is: it would not hurt to have a configuration for maintaining
ES translogs on separate disk devices, but I doubt it is worth the effort.
From a performance point of view, under high disk I/O contention, separate
disks might help to enhance disk IOPS, but with SSD, this became a
non-issue. ES translogs alone do not ensure the recoverability of a whole
index. The ES design for separation to recover an index successfully is
different from traditional databases. That is where replica shards on
remote nodes come into play.

Jörg

On Wed, Dec 11, 2013 at 8:34 AM, David Pilato david@pilato.fr wrote:

  1. Is it a good practice to store transaction log and index data in
    separate storage, just to be safe? If so, what property controls this
    behavior?

Not sure you can and if it's a best practice. I have never heard about it
so I think you should not worry about it.

What others think?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHMTvTUNnzLNQ_JqWBoHCJgfsyES7ozKb7FVHr5ggHCmQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jason Wee) #4

Very informative answers. I may have similar question to this, when a node
that hold the primary shard goes down, what is the interval of the replica
shard come online or is there any way to know it?

Thank you.

Jason

On Wed, Dec 11, 2013 at 5:58 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

In the world of traditional database, the transaction log is often kept on
separate devices / disk drives. This is reasonable because if a database
file gets corrupted, the transaction log must be kept in a safe place to
ensure full recovery. The idea is that database files contain saved
transactions and a volatile part which may get corrupted and can be
repaired by replaying the transaction log. The concept is based on the ACID
paradigm.

From what I understand, ES translogs are for communicating with Lucene.
Each Lucene operation is preceded by a write to the translog. By doing
this, request for Lucene operations are persisted and can not be
accidentally dropped. Lucene also does not know about transactional
operations and manages an internal cache in RAM. From time to time, the
Lucene segments are written to disk in append-only fashion (sync
operation). Lucene passes this operations to the JVM and the JVM passes it
to the operating system.

I'm not sure if ES translogs can be used to repair Lucene indexes by
replaying it, similar to the traditional database world. In case of low
disk situations, I have the experience it may succeed, but it does not
always succeed. The situation is often that ES translog is still intact and
Lucene files ran into out of disk space, but it can also occur that both ES
and Lucene tried to flush data to disk in failure, and ES can recover only
after the translog file got removed (at least in older ES versions).

The gateway concept is more similar to transactional processing. At node
startup, the gateway orchestrates the shards of an index and invokes
recovery with help from replica shards in order to recreate a valid index.

My conclusion is: it would not hurt to have a configuration for
maintaining ES translogs on separate disk devices, but I doubt it is worth
the effort. From a performance point of view, under high disk I/O
contention, separate disks might help to enhance disk IOPS, but with SSD,
this became a non-issue. ES translogs alone do not ensure the
recoverability of a whole index. The ES design for separation to recover an
index successfully is different from traditional databases. That is where
replica shards on remote nodes come into play.

Jörg

On Wed, Dec 11, 2013 at 8:34 AM, David Pilato david@pilato.fr wrote:

  1. Is it a good practice to store transaction log and index data in
    separate storage, just to be safe? If so, what property controls this
    behavior?

Not sure you can and if it's a best practice. I have never heard about it
so I think you should not worry about it.

What others think?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHMTvTUNnzLNQ_JqWBoHCJgfsyES7ozKb7FVHr5ggHCmQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHO4itwrDsiH2vLA-mZHc0%2BpCCbFPQcDMQS84qDsfRz-n2vfuw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #5

On Wed, Dec 11, 2013 at 4:58 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

In the world of traditional database, the transaction log is often kept on
separate devices / disk drives. This is reasonable because if a database
file gets corrupted, the transaction log must be kept in a safe place to
ensure full recovery. The idea is that database files contain saved
transactions and a volatile part which may get corrupted and can be
repaired by replaying the transaction log. The concept is based on the ACID
paradigm.

Normally you actually ship them off system if you really want recovery.
I'll describe what I know about PostgreSQL:

  1. Once the data files are successfully synchronized (think fsync) then
    the transaction log containing those operations can usually be recycled.
  2. If you want the ability to recover from some disaster (think meteor)
    what you do is bundle those logs when it is time to recycle them and ship
    them off site (think gzip and scp).
  3. You can replay those logs on top of a restored backup to get a
    reasonably recent copy of the database. It is also possible to replay only
    a portion of the logs to restore a point in time.
  4. If there is a problem before the data files are synced (think power
    failure) then the transaction logs can be replayed on top of the data files
    during server start up. The transaction logs are synced by definition of
    what a transaction is (unless you've relaxed it with settings).
  5. In the world of raid you aren't likely to lose the whole
    disk, but corrupting a file is totally possible if you write to it and lose
    power at the same time. If this happens while writing to the transaction
    log then the transaction wasn't committed. If this happens while writing
    to the data files the transaction log will recover the write on startup.

Normally moving transaction logs to another disk is a performance thing -
but only in very specific disk layouts. The thing is that the transaction
log is always written forwards while the data is written randomly and being
read as well. Too few disks and it isn't worth doing because you'll waste
2 whole disks on the transaction log. Too many and you've probably already
bought a batter backed raid card or SAN or something which smooths out the
always forward writes to the point where you won't see any benefit for the
trouble of partitioning. The battery backed raid card or SAN should be
able to recover from a power failure by replying its write ahead cache on
startup, thus keeping anything from getting corrupted.

My conclusion is: it would not hurt to have a configuration for maintaining

ES translogs on separate disk devices, but I doubt it is worth the effort.
From a performance point of view, under high disk I/O contention, separate
disks might help to enhance disk IOPS, but with SSD, this became a
non-issue. ES translogs alone do not ensure the recoverability of a whole
index. The ES design for separation to recover an index successfully is
different from traditional databases. That is where replica shards on
remote nodes come into play.

From a recoverability standpoint I'd rely on restoring actions from a
replica in the case of single node failure and replaying the actions from
another system of record for whole cluster failures.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1STO_CN0d6EPOC-CxPd8jCBnrVTOfhNuP%3D%2BffsfjAFfg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

The switchover from primary to replica shards is seamless for each node,
there is no interval, the local cluster state routing table holds the info
for all known shards of the cluster.

If a node goes down and a shard of that node is accessed, this raises an
error, and the failed access is rerouted to another node by the help of the
local cluster state routing table.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG9tXT9QgA%3D1gfZNvOr3HF%3DaZtv-zzt0yVYBwEe7OaaUQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Anantha Govindarajan) #7

Words from shay.banon regarding transaction logs in ElasticSearch.

kindly refer
: http://elasticsearch-users.115913.n3.nabble.com/Permanent-and-selective-transaction-log-td886751.html

The aim of the transaction log in elasticsearch is to basically make sure
that once you index data into elasticsearch it will be there, without the
need to perform a "commit" on lucene each time (which will kill
performance). This means that when a recovery happens, either from gateway
or from another node, the actual index files are recovered, and then the
transaction log is replayed. When an actual flush (in elasticsearch lingo)
happens, a commit on Lucene is performed and a new transaction log is
created.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d610b5c6-9913-498f-820c-f8e45eefa5b7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #8