If I turn off automatic indexing and refreshing, and continually execute
partial updates on the same document (say 100 times), do the updates change
the same record in the transaction log or will it create 100 changes? The
reason I'm curious is because when I ask ES to index (or refresh) after a
batch of partial updates, will it try to index the same document 100 times
or just once? So efficiency seems to be important here.
My data structure is a Customer with lots of Transactions with each record
containing a date, description, and dollar amount. I would like to see if
a denormalized data structure works here by keeping a list of transactions
on the customer, then updating new transactions into the same customer
record. But this would be very inefficient if the document would have to be
reindexed as many times as the number of incoming partial updates. I'm
hoping I can control this by turning off indexing/refreshing and let ES
update the same record in the Transaction log. I understand that Lucene
has immutable records, but that does not really mean that the Transaction
log has to have immutability, right?
Yes each partial update will record to the transaction log. Whenever the
log is flushed, each update is replayed and the document version is
incremented per update.
Thanks Binh, but I don't think you got the fullest gist of my question. I
want to be able to minimize reindexing of the same document too many times.
What I would like to do is to turn off indexing/refreshing and even
transaction log flushing in between of the batched partial updates. If I
do turn off all of these mechanisms and send a batch of partial updates to
the same document, then it seems there would be no need to reindex the
document into Lucene segments too many times. The whole batch could
operate on the same document and even increment the version numbers in the
transaction log itself. But I think you're implying that the document
would be reindexed into a lucene segment per partial update? What I'm
looking for is roughly this sequence of events:
document A is indexed and merged into the segment: document VERSION 1
turn off all indexing and transaction log flushing
send in a batch of changes to document A containing partial updates: {
A', A'', A''', A'''' }
transaction log operates on document A applying the partial updates above
modified document A now looks like A'''' and shows document VERSION 5
turn on indexing and transaction log flushing
document A'''' with version 5 gets merged and indexed into the segment
What I want to achieve is to absorb a lot of incremental updates to a
document in the transaction log without re-indexing per partial update. Is
this possible?
Thanks!!
On Wednesday, February 26, 2014 5:52:24 AM UTC-8, Binh Ly wrote:
Yes each partial update will record to the transaction log. Whenever the
log is flushed, each update is replayed and the document version is
incremented per update.
Thanks, I think I understand better now. I deleted my previous post so that
I can clarify better. The transaction log is just a backup mechanism for
durability. When you index a document, it eventually goes into a segment
(in memory). When you update it, the old doc is marked as deleted and then
a new one is indexed into a/the segment. If no flush/commit has been made
so far, the documents/segments are still in memory and each operation is
also recorded in the transaction log (one for the first index, and then
another for the update, and so on). When you do a flush, the in-memory
segments are then written to disk and then the transaction log is emptied
out (since we no longer need it as "backup" at this point). If on the other
hand you simply do a refresh, the "new" segments in memory are simply made
searchable (even though they are not necessarily written to disk yet) and
no flush to disk happens. In this case, the transaction log still contains
whatever it had in it so far.
So to answer your question, each update will require a new document to be
indexed (no way around it). And the transaction log is probably not
something that would matter in your scenario. I hope that helps.
Thanks for the explanation!! I thought that if a record is contained in
the transaction log, it would not be part of a sement. But as soon as we
flush the transaction log, it re-indexes the changes into the segment and
then commits to disk. But it sounds that a record can be both in the
transaction log as well as in the lucene segment itself but in-memory.
That sounds believable I'm trying to come up with a data model that
would be efficient for a Customer record that can have many transactions.
I've ruled out inner objects, nested objects, and now tinkering with
Parent/Child or complete denormalization.
Thanks again Binh!!
On Wednesday, February 26, 2014 1:46:14 PM UTC-8, Binh Ly wrote:
Thanks, I think I understand better now. I deleted my previous post so
that I can clarify better. The transaction log is just a backup mechanism
for durability. When you index a document, it eventually goes into a
segment (in memory). When you update it, the old doc is marked as deleted
and then a new one is indexed into a/the segment. If no flush/commit has
been made so far, the documents/segments are still in memory and each
operation is also recorded in the transaction log (one for the first index,
and then another for the update, and so on). When you do a flush, the
in-memory segments are then written to disk and then the transaction log is
emptied out (since we no longer need it as "backup" at this point). If on
the other hand you simply do a refresh, the "new" segments in memory are
simply made searchable (even though they are not necessarily written to
disk yet) and no flush to disk happens. In this case, the transaction log
still contains whatever it had in it so far.
So to answer your question, each update will require a new document to be
indexed (no way around it). And the transaction log is probably not
something that would matter in your scenario. I hope that helps.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.