In our project we're going to have several millions of documents
indexed by Elastic Search. Every day about 10% of all documents are
updated. Fields updated are numerical values like view counts that
we're going to use for scoring.
I have some questions regarding this situation:
Is there any simpler way to update index entry other then fetching
of whole document by _source and then reindexing modified version with
the same id?
How bad index fragmentation in this scenario could possibly hit me?
Any recommendations on index options for frequently updating fields?
On Thu, Jul 8, 2010 at 4:42 PM, Mykhailo Korbakov rmihael@gmail.com wrote:
Hi everyone.
In our project we're going to have several millions of documents
indexed by Elastic Search. Every day about 10% of all documents are
updated. Fields updated are numerical values like view counts that
we're going to use for scoring.
I have some questions regarding this situation:
Is there any simpler way to update index entry other then fetching
of whole document by _source and then reindexing modified version with
the same id?
There is no way to do partial update, so you need to fetch, update and index
back.
How bad index fragmentation in this scenario could possibly hit me?
There will be fragmentation, but it will slowly be merged out.
Any recommendations on index options for frequently updating fields?
Nothing special for this case, its a very valid case.
On Thu, Jul 8, 2010 at 4:42 PM, Mykhailo Korbakov rmihael@gmail.com wrote:
Hi everyone.
In our project we're going to have several millions of documents
indexed by Elastic Search. Every day about 10% of all documents are
updated. Fields updated are numerical values like view counts that
we're going to use for scoring.
I have some questions regarding this situation:
Is there any simpler way to update index entry other then fetching
of whole document by _source and then reindexing modified version with
the same id?
There is no way to do partial update, so you need to fetch, update and index
back.
How bad index fragmentation in this scenario could possibly hit me?
There will be fragmentation, but it will slowly be merged out.
Any recommendations on index options for frequently updating fields?
Nothing special for this case, its a very valid case.
Thank you for answering, Shay.
Just to make my soul completely calm down: is there any way to monitor
fragmentation? May be I'll had to tune merger somehow to reduce it,
etc.
There isn't currently an API to return its value, but you can go to each
shard storage, and check the number of files, they reflect the number of
segments. There are parameters to control it (such as the merge_factor), and
there is an API to force "optimization".
Post 0.9 I am going to provide a full set of API for index level "info" and
"stats", in a similar manner current version provides for node. In them,
this information will be exposed.
-shay.banon
On Thu, Jul 8, 2010 at 5:26 PM, Mykhailo Korbakov rmihael@gmail.com wrote:
On Thu, Jul 8, 2010 at 4:42 PM, Mykhailo Korbakov rmihael@gmail.com
wrote:
Hi everyone.
In our project we're going to have several millions of documents
indexed by Elastic Search. Every day about 10% of all documents are
updated. Fields updated are numerical values like view counts that
we're going to use for scoring.
I have some questions regarding this situation:
Is there any simpler way to update index entry other then fetching
of whole document by _source and then reindexing modified version with
the same id?
There is no way to do partial update, so you need to fetch, update and
index
back.
How bad index fragmentation in this scenario could possibly hit me?
There will be fragmentation, but it will slowly be merged out.
Any recommendations on index options for frequently updating fields?
Nothing special for this case, its a very valid case.
Thank you for answering, Shay.
Just to make my soul completely calm down: is there any way to monitor
fragmentation? May be I'll had to tune merger somehow to reduce it,
etc.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.