Elasticsearch/Lucene Delete space reuse? recovery?

I'm starting a project to index log files. I don't particularly want to
wait until the log files roll over. There will be files from 100's of apps
running across 100's of machines (not all apps intersect with all machines,
but you get the drift). Some roll over very fast; some may take days.

The problem comes that if I am constantly reindexing the same document
(same id) am I loosing all old space (store and or index) or is
Elasticsearch/Lucene smart enough to say here's a new version we'll
overwrite the old store/index entries and point to this one where they are
the same and add new ones.

Certainly, there is a more sophisticated model that treats every line as a
unique document/row such that this doesn't become an issue, but I'm not
ready to spend that kind of dev and hardware at this issue. (Our
elasticsearch solution is wrapped in a system that becomes really heavy
handed when indexing such small pieces.)

--Shannon Monasco

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lucene will hold onto deleted documents until a merged is performed. An
update in Lucene is basically an atomic delete/insert.

An optimize will help reclaim the space used by deleted documents. Did you
change your merge settings? Deleted documents should eventually be removed
whenever new segments are created.

Cheers,

Ivan

On Tue, Jun 3, 2014 at 8:54 AM, smonasco smonasco@gmail.com wrote:

I'm starting a project to index log files. I don't particularly want to
wait until the log files roll over. There will be files from 100's of apps
running across 100's of machines (not all apps intersect with all machines,
but you get the drift). Some roll over very fast; some may take days.

The problem comes that if I am constantly reindexing the same document
(same id) am I loosing all old space (store and or index) or is
Elasticsearch/Lucene smart enough to say here's a new version we'll
overwrite the old store/index entries and point to this one where they are
the same and add new ones.

Certainly, there is a more sophisticated model that treats every line as a
unique document/row such that this doesn't become an issue, but I'm not
ready to spend that kind of dev and hardware at this issue. (Our
elasticsearch solution is wrapped in a system that becomes really heavy
handed when indexing such small pieces.)

--Shannon Monasco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDuQvdfN7oBBA%2BWX%2BOCKGu6SxiqFckhVqGXm5QbenXYqg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I haven't changed my merge settings. How often should segments be created
and how often should merges happen naturally?
On Jun 4, 2014 4:58 PM, "Ivan Brusic" ivan@brusic.com wrote:

Lucene will hold onto deleted documents until a merged is performed. An
update in Lucene is basically an atomic delete/insert.

An optimize will help reclaim the space used by deleted documents. Did you
change your merge settings? Deleted documents should eventually be removed
whenever new segments are created.

Cheers,

Ivan

On Tue, Jun 3, 2014 at 8:54 AM, smonasco smonasco@gmail.com wrote:

I'm starting a project to index log files. I don't particularly want to
wait until the log files roll over. There will be files from 100's of apps
running across 100's of machines (not all apps intersect with all machines,
but you get the drift). Some roll over very fast; some may take days.

The problem comes that if I am constantly reindexing the same document
(same id) am I loosing all old space (store and or index) or is
Elasticsearch/Lucene smart enough to say here's a new version we'll
overwrite the old store/index entries and point to this one where they are
the same and add new ones.

Certainly, there is a more sophisticated model that treats every line as
a unique document/row such that this doesn't become an issue, but I'm not
ready to spend that kind of dev and hardware at this issue. (Our
elasticsearch solution is wrapped in a system that becomes really heavy
handed when indexing such small pieces.)

--Shannon Monasco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_N5_LFXShyU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDuQvdfN7oBBA%2BWX%2BOCKGu6SxiqFckhVqGXm5QbenXYqg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDuQvdfN7oBBA%2BWX%2BOCKGu6SxiqFckhVqGXm5QbenXYqg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFDU5WJQzcK4YrZC%3DO5wJs8G0c5zCsGaXzc%3D19NWz4YHJbOy6w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

The default merge policy in Lucene (TieredMergePolicy) has a bias towards
segments with more deletes, so it is "trying" to merge those ones away.
You can increase this bias by setting index.reclaim_deletes_weight (see

) but be careful not to make it so high that awful merges are being
selected.

If you want to see the gory details, as of Elasticsearch 1.2 you can turn
on lucene.iw: TRACE in config/logging.yml to see when merges run, which
segments, how many deletes those segments had, etc.

Mike

http://blog.mikemccandless.com

On Thu, Jun 5, 2014 at 9:12 AM, Shannon Monasco smonasco@gmail.com wrote:

I haven't changed my merge settings. How often should segments be created
and how often should merges happen naturally?
On Jun 4, 2014 4:58 PM, "Ivan Brusic" ivan@brusic.com wrote:

Lucene will hold onto deleted documents until a merged is performed. An
update in Lucene is basically an atomic delete/insert.

An optimize will help reclaim the space used by deleted documents. Did
you change your merge settings? Deleted documents should eventually be
removed whenever new segments are created.

Cheers,

Ivan

On Tue, Jun 3, 2014 at 8:54 AM, smonasco smonasco@gmail.com wrote:

I'm starting a project to index log files. I don't particularly want to
wait until the log files roll over. There will be files from 100's of apps
running across 100's of machines (not all apps intersect with all machines,
but you get the drift). Some roll over very fast; some may take days.

The problem comes that if I am constantly reindexing the same document
(same id) am I loosing all old space (store and or index) or is
Elasticsearch/Lucene smart enough to say here's a new version we'll
overwrite the old store/index entries and point to this one where they are
the same and add new ones.

Certainly, there is a more sophisticated model that treats every line as
a unique document/row such that this doesn't become an issue, but I'm not
ready to spend that kind of dev and hardware at this issue. (Our
elasticsearch solution is wrapped in a system that becomes really heavy
handed when indexing such small pieces.)

--Shannon Monasco

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_N5_LFXShyU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDuQvdfN7oBBA%2BWX%2BOCKGu6SxiqFckhVqGXm5QbenXYqg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDuQvdfN7oBBA%2BWX%2BOCKGu6SxiqFckhVqGXm5QbenXYqg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAFDU5WJQzcK4YrZC%3DO5wJs8G0c5zCsGaXzc%3D19NWz4YHJbOy6w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAFDU5WJQzcK4YrZC%3DO5wJs8G0c5zCsGaXzc%3D19NWz4YHJbOy6w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRdJGy4Fai%2BXOuw2m4r8k-Xts6riFMgGXcZvCLnCS_w9kg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks guys!
On Jun 5, 2014 7:17 AM, "Michael McCandless" <
michael.mccandless@elasticsearch.com> wrote:

The default merge policy in Lucene (TieredMergePolicy) has a bias towards
segments with more deletes, so it is "trying" to merge those ones away.
You can increase this bias by setting index.reclaim_deletes_weight (see
Elasticsearch Platform — Find real-time answers at scale | Elastic
) but be careful not to make it so high that awful merges are being
selected.

If you want to see the gory details, as of Elasticsearch 1.2 you can turn
on lucene.iw: TRACE in config/logging.yml to see when merges run, which
segments, how many deletes those segments had, etc.

Mike

http://blog.mikemccandless.com

On Thu, Jun 5, 2014 at 9:12 AM, Shannon Monasco smonasco@gmail.com
wrote:

I haven't changed my merge settings. How often should segments be
created and how often should merges happen naturally?
On Jun 4, 2014 4:58 PM, "Ivan Brusic" ivan@brusic.com wrote:

Lucene will hold onto deleted documents until a merged is performed.
An update in Lucene is basically an atomic delete/insert.

An optimize will help reclaim the space used by deleted documents. Did
you change your merge settings? Deleted documents should eventually be
removed whenever new segments are created.

Cheers,

Ivan

On Tue, Jun 3, 2014 at 8:54 AM, smonasco smonasco@gmail.com wrote:

I'm starting a project to index log files. I don't particularly want
to wait until the log files roll over. There will be files from 100's of
apps running across 100's of machines (not all apps intersect with all
machines, but you get the drift). Some roll over very fast; some may take
days.

The problem comes that if I am constantly reindexing the same document
(same id) am I loosing all old space (store and or index) or is
Elasticsearch/Lucene smart enough to say here's a new version we'll
overwrite the old store/index entries and point to this one where they are
the same and add new ones.

Certainly, there is a more sophisticated model that treats every line
as a unique document/row such that this doesn't become an issue, but I'm
not ready to spend that kind of dev and hardware at this issue. (Our
elasticsearch solution is wrapped in a system that becomes really heavy
handed when indexing such small pieces.)

--Shannon Monasco

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d9d38f7-ba4f-470c-9864-5b9af8abc773%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_N5_LFXShyU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDuQvdfN7oBBA%2BWX%2BOCKGu6SxiqFckhVqGXm5QbenXYqg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDuQvdfN7oBBA%2BWX%2BOCKGu6SxiqFckhVqGXm5QbenXYqg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAFDU5WJQzcK4YrZC%3DO5wJs8G0c5zCsGaXzc%3D19NWz4YHJbOy6w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAFDU5WJQzcK4YrZC%3DO5wJs8G0c5zCsGaXzc%3D19NWz4YHJbOy6w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_N5_LFXShyU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAD7smRdJGy4Fai%2BXOuw2m4r8k-Xts6riFMgGXcZvCLnCS_w9kg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAD7smRdJGy4Fai%2BXOuw2m4r8k-Xts6riFMgGXcZvCLnCS_w9kg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFDU5WKxZBH%2BCaT%2BLCouTP%2B42uB5HTkHg%2BKsQu_BGbyy1Qi3Tg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.