Index size improvements in 0.90?

I finally finished a grueling upgrade of my local code from Lucene 3.6 to
4.3. I don't use elasticsearch for everything and still have a fair amount
of Lucene code. You name it, I have a custom class for it.

With the new Lucene jars in place, I was finally able to upgrade
elasticsearch from 0.90.1 from 0.20.0 (Lucene class conflicts being the
obstacle). So far my Lucene code has produced much smaller indices, which
I'm still testing. My elasticsearch specific code has not changed (besides
fixing API changes), and neither has my configuration. I do some
pre-tokenization on the client side for various reasons, but elasticsearch
does the bulk of the analysis. The resulting test index is one third of the
original size:

size: 15.8gb (15.8gb)
docs: 8711039 (8711039)

size: 5.2gb (5.2gb)
docs: 8757039 (8757039)

I did disable timestamps (elasticsearch bug which I will fix), but
everything else is the same. A two-thirds reduction scares me a bit. Has
anyone seen such a dramatic reduction in index size?

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Ivan!

The vast majority of code changes that I needed to make had to do with the
getters. For example, the (cleaner, IMHO) tokens method was gone and the
getTokens method was called instead.

The truly vexing change had to do with facets, and once I figured it out
the changes were rather simple. But until then, I almost wore out Google!
For example:

First a change to the import:

-import org.elasticsearch.search.facet.AbstractFacetBuilder;

+import org.elasticsearch.search.facet.FacetBuilder;

And a change to the object class returned:

  • public AbstractFacetBuilder getFacetRequest();
  • public FacetBuilder getFacetRequest();

And when rippled throughout the abstract base class and my three derived
classes (including one that implements a true multi-field hierarchy!),
everything worked fine.

And yes, the indices were one-half to one-third the size when rebuilt. I
remembered something about compression being the default in 0.90.0, and I
never added compression options to my 0.20.4 indices.

Brian

On Friday, June 14, 2013 7:14:27 PM UTC-4, Ivan Brusic wrote:

I finally finished a grueling upgrade of my local code from Lucene 3.6 to
4.3. I don't use elasticsearch for everything and still have a fair amount
of Lucene code. You name it, I have a custom class for it.

With the new Lucene jars in place, I was finally able to upgrade
elasticsearch from 0.90.1 from 0.20.0 (Lucene class conflicts being the
obstacle). So far my Lucene code has produced much smaller indices, which
I'm still testing. My elasticsearch specific code has not changed (besides
fixing API changes), and neither has my configuration. I do some
pre-tokenization on the client side for various reasons, but elasticsearch
does the bulk of the analysis. The resulting test index is one third of the
original size:

size: 15.8gb (15.8gb)
docs: 8711039 (8711039)

size: 5.2gb (5.2gb)
docs: 8757039 (8757039)

I did disable timestamps (elasticsearch bug which I will fix), but
everything else is the same. A two-thirds reduction scares me a bit. Has
anyone seen such a dramatic reduction in index size?

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

After your post i did a test with the playframework and play2-elasticsearch

  • Plugin + 1 remote elasticsearch node for each version.
    You can switch in seconds between 0.20.5 and 0.90.1.
    I did a test with a class and only a long string. And yes. the newer lucene
    index is very small.
    It is 90M against 900M. The results are the same when searching.
    My Es-Mapping is trivial:
    {
    "indexTest": {
    "properties": {
    "name": {
    "type": "string",
    "store": "yes",
    "index": "analyzed",
    "null_value": "na"
    }
    }
    }
    }

I hope i didn't miss anything, but i don't think so.

Am Samstag, 15. Juni 2013 01:14:27 UTC+2 schrieb Ivan Brusic:

I finally finished a grueling upgrade of my local code from Lucene 3.6 to
4.3. I don't use elasticsearch for everything and still have a fair amount
of Lucene code. You name it, I have a custom class for it.

With the new Lucene jars in place, I was finally able to upgrade
elasticsearch from 0.90.1 from 0.20.0 (Lucene class conflicts being the
obstacle). So far my Lucene code has produced much smaller indices, which
I'm still testing. My elasticsearch specific code has not changed (besides
fixing API changes), and neither has my configuration. I do some
pre-tokenization on the client side for various reasons, but elasticsearch
does the bulk of the analysis. The resulting test index is one third of the
original size:

size: 15.8gb (15.8gb)
docs: 8711039 (8711039)

size: 5.2gb (5.2gb)
docs: 8757039 (8757039)

I did disable timestamps (elasticsearch bug which I will fix), but
everything else is the same. A two-thirds reduction scares me a bit. Has
anyone seen such a dramatic reduction in index size?

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think the following link is interesting and explains a lot:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Indeed, Lucene 4.3 has much smaller indices, especially if you have small
or easily-compressible documents (cthoma's link) and if you have highly
frequent terms[1].

[1]

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am on the Lucene mailing list (I rarely post though) and subscribe to
Mike's and Andrien's blog feeds, but nowhere have a seen comments about how
dramatic the reduction of index size can actually be.

I have been running without stored fields and compressed source for a
while, so I assumed the new compression scheme used in Lucene 4 would not
offer much savings. Then again, I really was not after reduced index size
(although it helps with the IO cache). The exact reason for the
elasticsearch upgrade is for better cache management since I have
encountered a huge explosion of the field cache with the introduction of
nested documents.

A rough test with these new indices showed little differences in QPS, but
the number of threads and number of GCs went down dramatically. What I
really want to monitor was the field cache usage, but that stat moved in
0.90 and I was not able to find it.

Great job by the Lucene and elasticsearch teams.

Cheers,

Ivan

On Sun, Jun 16, 2013 at 4:31 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi,

Indeed, Lucene 4.3 has much smaller indices, especially if you have small
or easily-compressible documents (cthoma's link) and if you have highly
frequent terms[1].

[1]
Changing Bits: Lucene's new BlockPostingsFormat, thanks to Google Summer of Code

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.