On Tue, Sep 13, 2011 at 12:02 PM, Per Steffensen email@example.com:
Shay Banon skrev:
Dear god, each question you send is 5000 words, when all of them can be
really be a one sentence question... . Its hard to answer those type of
questions, since they are not really questions... .
Actually only 460 words (2930 characters) It is not only one question.
There are many questions. I use the commonly accepted marker for questions -
namly the questionmark (?) - but you tend to overlook them anyway. The
reason for being so verbose is to make sure not to be misunderstood with
ambiguous one-sentence questions. Despite that you tend to miss the point in
some of the questions. I will have to be even more verbose in the future
No, seriously, dont waste anymore of your time answering my questions if you
do not think you have the time for reading and understanding them properly.
You write very long paragraph with multiple questions in them. By the time
one ends up reading the paragraph, you forget the questions that were asked
in the beginning. You somehow managed to break the paragraph into smaller
parts when answering, why not do it in the get go.
The other problem is that you repeat the same questions several times, and
manage to ask a question in a very convoluted manner. Maybe its the language
barrier, I don't know, but, you need to find a way to be more concise. The
fact that you get answers at all on this mailing list (compared to others,
where people will simply say, frack it, I am not going to spend time reading
all of this) is something that you should appreciate.
Writing longer text does not make your questions more understandable. You
questions are very simple (at least the ones you asked so far, and of
course, there is no problem with asking them). But, the amount of words you
put in each one, well, its strange...
When indexing a document, the document gets indexed in a sync manner on a
shard and its replicas. Its also written to each shard local transaction log
to make sure it does not get lost.
With local gateway (the default, and you should use that), on a full
cluster restart, the state of the cluster, and the indices will be recovered
based on the data stored on each node.
Thanks. A small comment on why you recommend to use the default gateway,
please? We actually planed to use the Hadoop gateway since we will have
Hadoop running on the machines anyway.
On Mon, Sep 12, 2011 at 12:25 PM, Per Steffensen firstname.lastname@example.org:
Reading about persistence in ES I have a hard time figuring out exactly
how and when it works - node-local storage vs gateway-storage. Do you have a
pointer to a thorough description on how persistence work?
No answer. Will assume that such a thorough description does not exist.
you are asking for the nitty gritty details on how the actual recovery
works, thats a different question.
I general I want to make sure that data will be "persisted-persisted",
when an indexing-process (some code that I will write doing a number of
index-operations agains ES) has done a number of index-operation (maybe bulk
indexing) and it finishes. It isnt allowed to be possible that an
indexing-process believes that is has indexed a number of documents but that
they actually have not not been "persisted-persisted" yet. With
"persisted-persisted" I mean that no data will be lost even though all nodes
in the cluster will stop (e.g. due to global power-outage) a split-sec after
the process finished, or even though any single disk will crash a split-sec
after the process finished. So "persisted-persisted" means stored on disk
(will survive shutdown of machine) - actually stored on at least two disks
(redundant). I believe I've heard one of the ES guys saying something about
documents not being "persisted-persisted" unless IndexWriter.commit (or
something) of the Lucene underneath has been called, and that
IndexWriter.commit is called asynchronously when ES see fit. If that is true
I guess I need a synchronous way through ES to make sure that this has
happened. I also heard that this operation is expensive, and that it should
therefore not be done too often. I need to make sure that it has been done
when my indexing-process finishes (call the operation as the last thing in
the process), but if it is expensive I guess I need to make sure that my
indexing-processes are not too small with respect to the number of documents
that they index. Any comments on that?
No comments on async persistence (calling of IndexWriter.commit). Will
asume there is no such thing happening, even though I believe I heard it
mentioned in the Berlin conference talk.
It is "persisted-persisted". When you index a document, its there, safely
written to a transaction log (so no need to call IndexWriter#commit), and
replicated (in a sync manner by default) to all the shard replicas.
What will a practical lower limit on number of index-operations that
have to be done between IndexWriter.commits?
No answer, but not relevant if no async persistence is going on.
As I understand it, some information will be persisted "locally" on the
nodes and some information will be persisted in the gateway. Exactly what
kind of information will be persisted "locally" on the nodes and what kind
of information will be persisted in the gateway?
No answer. I still have a problem understanding local persistence vs
gateway persistence. Maybe there is no such thing as local persistence (only
if the gateway is the default local), even though I would assume that the
Lucene index itself is persisted locally. I will make my own tests and read
the code to understand.
The local gateway can recovery both the cluster state (which indices were
created, mappings) and the indices date from each node local storage. It
uses the local stored indices data to recovery itself, and a specially
placed files for the cluster metadata.
E.g. is document-information persisted both "locally" on the nodes and
on the gateway, or only "locally" on the nodes?
This was a followup question of the prior question. No anwser
With local gateway, its reuses the same local index storage. It does not
need to copy it around.
Is it persisted "locally" on all the nodes running a replica of the
shard containing the document, or only on the node running the primary
Got my answer. Thanks.
Exactly when is information written to disk (locally or in the gateway)
You say that the document get indexed in a sync manner, but you dont
mention what operation it is synch'ed with. I will assume that the indexing
and writing to the local transaction log will happen synchronously in the
"execute"-method. It would have been nice if that was stated clearly though
- especially when the question was so clear, about what operation exactly
does the actualy indexing and whether or not it was done synchronously.
When you index a document, the call does not return until it has been
executed on all shards (sync replication). On each shard, it will index it
in Lucene, and add it to a transaction log.
Regard, Per Steffensen