Hi,
We are using ElasticSearch(ES) to index large number of documents every day.
By default , ElasticSearch(Lucene) is creating smaller segments and it's
background threads makes them bigger based on merging policy.
To reduce merge cycles and have efficient segment at first place , i would
like ES to create bigger segment in memory and writes to disk(no probs if
it uses more memory and searches are available only after it flushes
segments to disk).
I knew Older version of Lucene 3.0 had setting *setMaxBufferedDocshttp://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/index/IndexWriter.html#setMaxBufferedDocs(int)
(*Determines the minimal number of documents required before the buffered
in-memory documents are flushed as a new Segment).
Can I do the same in ES where creating bigger segments(may be 100 MB or
bigger ) and reduce merge cycles(max 1 merge or will avoid merge and run
force merge at the end of day , as i am creating new index every day) to
reduce IO substantially.
raising your refresh interval should help here. We flush every 3 sec by
default which creates lots of segments. if you set it to -1 you can control
it yourself by calling flush or refresh via the API. You should also look
at indices.memory.index_buffer_size (Elasticsearch Platform — Find real-time answers at scale | Elastic)
to control how much ram is used for doc buffering. Yet, Lucene 4 works
differently and doesn't merge everything in memory. you can use less
threads and it will create less segments. Note, for throughput it might be
better to write more but smaller segments though.
simon
On Saturday, August 3, 2013 3:59:30 AM UTC+2, Prakash Patidar wrote:
Hi,
We are using Elasticsearch(ES) to index large number of documents every
day.
By default , Elasticsearch(Lucene) is creating smaller segments and it's
background threads makes them bigger based on merging policy.
To reduce merge cycles and have efficient segment at first place , i would
like ES to create bigger segment in memory and writes to disk(no probs if
it uses more memory and searches are available only after it flushes
segments to disk).
I knew Older version of Lucene 3.0 had setting *setMaxBufferedDocshttp://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/index/IndexWriter.html#setMaxBufferedDocs(int)
(*Determines the minimal number of documents required before the
buffered in-memory documents are flushed as a new Segment).
Can I do the same in ES where creating bigger segments(may be 100 MB or
bigger ) and reduce merge cycles(max 1 merge or will avoid merge and run
force merge at the end of day , as i am creating new index every day) to
reduce IO substantially.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.