Removing unused fields (more Lucene than ES but..)

ok, this is more low level Lucene, but in the context of an ElasticSearch
cluster, is there any way to get an index/shard to optimize away a bunch of
fields that are no longer used (literally have no term values associated
with it.

We had an application bug introduced that polluted an index with a very
large number of fields (25,000 fields... cough) , and lets just say
things weren't well after that.

we've deleted all the rogue records, but the shards still contain the raw
Lucene Field information (we've inspected these with Luke) and the cluster
is heavily CPU bound processing "refreshVersionTable" calls that is in a
large loop a function of the number of fields in the segments.

We've attempted a test optimize of the index using Luke on a single shard,
but the residual segments post-optimize still contain a large number of
these fields, all with no values associated with them.

Obviously a reindex would do this, but if there's any other bright ideas
that are quicker than that (45 million item index we're trying to keep up)
would be most welcome!

We're on ES 0.19.10 still (lucene 3.6.1). (you can tell me "upgrade"
another day please..)

Here's a snapshot picture from the Luke on a single shard from this index.

cheers!

Paul Smith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHfYWB5nO%3DDQ50SQ4kgde6JvT%3DgjQ_7FmLbVcXVk5Kiurwme%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

It is actually possible in lucene 4, but there is nothing really
convenient setup to do this.

You have two choices there:

  1. trigger a massive merge (essentially an optimize), by wrapping all
    readers and calling IndexWriter.addIndexes(Reader...).
  2. wrap readers in a custom merge policy and do it slowly over time.

in both cases you'd use something like
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/index/FieldFilterAtomicReader.java

for lucene 3, this would be more complicated, I don't think its
impossible but there is no available code unfortunately in this case.

On Mon, Mar 31, 2014 at 11:37 PM, Paul Smith tallpsmith@gmail.com wrote:

ok, this is more low level Lucene, but in the context of an Elasticsearch
cluster, is there any way to get an index/shard to optimize away a bunch of
fields that are no longer used (literally have no term values associated
with it.

We had an application bug introduced that polluted an index with a very
large number of fields (25,000 fields... cough) , and lets just say things
weren't well after that.

we've deleted all the rogue records, but the shards still contain the raw
Lucene Field information (we've inspected these with Luke) and the cluster
is heavily CPU bound processing "refreshVersionTable" calls that is in a
large loop a function of the number of fields in the segments.

We've attempted a test optimize of the index using Luke on a single shard,
but the residual segments post-optimize still contain a large number of
these fields, all with no values associated with them.

Obviously a reindex would do this, but if there's any other bright ideas
that are quicker than that (45 million item index we're trying to keep up)
would be most welcome!

We're on ES 0.19.10 still (lucene 3.6.1). (you can tell me "upgrade"
another day please..)

Here's a snapshot picture from the Luke on a single shard from this index.

cheers!

Paul Smith

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHfYWB5nO%3DDQ50SQ4kgde6JvT%3DgjQ_7FmLbVcXVk5Kiurwme%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZXZNf2y7AXsJFJg7hBOyJmEW%2BOvcNZse1JfQx0XcFyynA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On 1 April 2014 15:23, Robert Muir robert.muir@elasticsearch.com wrote:

It is actually possible in lucene 4, but there is nothing really
convenient setup to do this.

You have two choices there:

  1. trigger a massive merge (essentially an optimize), by wrapping all
    readers and calling IndexWriter.addIndexes(Reader...).
  2. wrap readers in a custom merge policy and do it slowly over time.

in both cases you'd use something like

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/index/FieldFilterAtomicReader.java

for lucene 3, this would be more complicated, I don't think its
impossible but there is no available code unfortunately in this case.

Thanks Robert for the reply, all of that sounds fairly hairy. I did try a
full optimize of the shard index using Luke, but the residual über-segment
still has the filed definitions in it. Are saying in (1) that the
creating of a new Shard index through a custom call to
IndexWriter.addIndexes(..) would produce a fully optimized index without
the fields, and that is different than what an Optimize operation through
ES would call? More a technical question now on what the differences is
between the Optimize call and a manual
create-new-index-from-multiple-readers. (I actually though that's what the
Optimize does in practical terms, but there's obviously more or less going
on under the hood under these different code paths).

We're going the reindex route for now, was just hoping there was some
special trick we could do a little easier than the above. :slight_smile:

thanks again for your time!

Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHfYWB7rg6R8-B4BeJxe%2BbCJvMdJPXwVev0Udd%3DB91kc6E6uGQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith tallpsmith@gmail.com wrote:

Thanks Robert for the reply, all of that sounds fairly hairy. I did try a
full optimize of the shard index using Luke, but the residual über-segment
still has the filed definitions in it. Are saying in (1) that the creating
of a new Shard index through a custom call to IndexWriter.addIndexes(..)
would produce a fully optimized index without the fields, and that is
different than what an Optimize operation through ES would call? More a
technical question now on what the differences is between the Optimize call
and a manual create-new-index-from-multiple-readers. (I actually though
that's what the Optimize does in practical terms, but there's obviously more
or less going on under the hood under these different code paths).

We're going the reindex route for now, was just hoping there was some
special trick we could do a little easier than the above. :slight_smile:

Optimize and normal merging don't "garbage collect" unused fields from
fieldinfos:

https://issues.apache.org/jira/browse/LUCENE-1761

The addindexes trick is also a forced merge, but it decorates the
readers-to-be-merged: lying
and hiding the fields as if they don't exist.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW2-FEjA6CChSR3%2Br0GQYAfJ9ZOOhyU565V79QMTrPFWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the JIRA link Robert, I've added a comment to it just to share
the real world aspect of what happened to us for background.

On 1 April 2014 18:29, Robert Muir robert.muir@elasticsearch.com wrote:

On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith tallpsmith@gmail.com wrote:

Thanks Robert for the reply, all of that sounds fairly hairy. I did try
a
full optimize of the shard index using Luke, but the residual
über-segment
still has the filed definitions in it. Are saying in (1) that the
creating
of a new Shard index through a custom call to IndexWriter.addIndexes(..)
would produce a fully optimized index without the fields, and that is
different than what an Optimize operation through ES would call? More a
technical question now on what the differences is between the Optimize
call
and a manual create-new-index-from-multiple-readers. (I actually though
that's what the Optimize does in practical terms, but there's obviously
more
or less going on under the hood under these different code paths).

We're going the reindex route for now, was just hoping there was some
special trick we could do a little easier than the above. :slight_smile:

Optimize and normal merging don't "garbage collect" unused fields from
fieldinfos:

[LUCENE-1761] low level Field metadata is never removed from index - ASF JIRA

The addindexes trick is also a forced merge, but it decorates the
readers-to-be-merged: lying
and hiding the fields as if they don't exist.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW2-FEjA6CChSR3%2Br0GQYAfJ9ZOOhyU565V79QMTrPFWQ%40mail.gmail.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHfYWB5dHJ70cxzZuZ21gdeQwN1ckZ40Yu4K%2BmJYexK-i01AVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you Paul, I added some comments just so the technical challenges
and risks are clear.

Its unfortunately not so easy to fix...

On Thu, Apr 3, 2014 at 7:49 PM, Paul Smith tallpsmith@gmail.com wrote:

Thanks for the JIRA link Robert, I've added a comment to it just to share
the real world aspect of what happened to us for background.

On 1 April 2014 18:29, Robert Muir robert.muir@elasticsearch.com wrote:

On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith tallpsmith@gmail.com wrote:

Thanks Robert for the reply, all of that sounds fairly hairy. I did try
a
full optimize of the shard index using Luke, but the residual
über-segment
still has the filed definitions in it. Are saying in (1) that the
creating
of a new Shard index through a custom call to IndexWriter.addIndexes(..)
would produce a fully optimized index without the fields, and that is
different than what an Optimize operation through ES would call? More a
technical question now on what the differences is between the Optimize
call
and a manual create-new-index-from-multiple-readers. (I actually though
that's what the Optimize does in practical terms, but there's obviously
more
or less going on under the hood under these different code paths).

We're going the reindex route for now, was just hoping there was some
special trick we could do a little easier than the above. :slight_smile:

Optimize and normal merging don't "garbage collect" unused fields from
fieldinfos:

[LUCENE-1761] low level Field metadata is never removed from index - ASF JIRA

The addindexes trick is also a forced merge, but it decorates the
readers-to-be-merged: lying
and hiding the fields as if they don't exist.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW2-FEjA6CChSR3%2Br0GQYAfJ9ZOOhyU565V79QMTrPFWQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHfYWB5dHJ70cxzZuZ21gdeQwN1ckZ40Yu4K%2BmJYexK-i01AVQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW6bH_M5qcf0qnRUKxvm%3DVjMdLO7RAxCbeWsKMN%3DjDrqA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

yeah, I probably should have thought when I read that that if it was easy,
it probably would have already been done! :slight_smile:

Paul

On 4 April 2014 11:07, Robert Muir robert.muir@elasticsearch.com wrote:

Thank you Paul, I added some comments just so the technical challenges
and risks are clear.

Its unfortunately not so easy to fix...

On Thu, Apr 3, 2014 at 7:49 PM, Paul Smith tallpsmith@gmail.com wrote:

Thanks for the JIRA link Robert, I've added a comment to it just to share
the real world aspect of what happened to us for background.

On 1 April 2014 18:29, Robert Muir robert.muir@elasticsearch.com
wrote:

On Tue, Apr 1, 2014 at 2:41 AM, Paul Smith tallpsmith@gmail.com
wrote:

Thanks Robert for the reply, all of that sounds fairly hairy. I did
try
a
full optimize of the shard index using Luke, but the residual
über-segment
still has the filed definitions in it. Are saying in (1) that the
creating
of a new Shard index through a custom call to
IndexWriter.addIndexes(..)
would produce a fully optimized index without the fields, and that
is
different than what an Optimize operation through ES would call? More
a
technical question now on what the differences is between the Optimize
call
and a manual create-new-index-from-multiple-readers. (I actually
though
that's what the Optimize does in practical terms, but there's
obviously
more
or less going on under the hood under these different code paths).

We're going the reindex route for now, was just hoping there was some
special trick we could do a little easier than the above. :slight_smile:

Optimize and normal merging don't "garbage collect" unused fields from
fieldinfos:

[LUCENE-1761] low level Field metadata is never removed from index - ASF JIRA

The addindexes trick is also a forced merge, but it decorates the
readers-to-be-merged: lying
and hiding the fields as if they don't exist.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW2-FEjA6CChSR3%2Br0GQYAfJ9ZOOhyU565V79QMTrPFWQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/CAHfYWB5dHJ70cxzZuZ21gdeQwN1ckZ40Yu4K%2BmJYexK-i01AVQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZW6bH_M5qcf0qnRUKxvm%3DVjMdLO7RAxCbeWsKMN%3DjDrqA%40mail.gmail.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHfYWB4wQ%3D5u9%2BWU6kiA4LRiDrHNak5_qsFpVs5E5ziNDH1v%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.