Compression options for small documents


(dottom) #1

Would it be possible to eventually compress data beyond _source compression? I am indexing a lot of small documents 200-400 bytes in size.

Using default settings, the index is 4-6x the size of the raw text data (range depends on what fields I add). Turning on _source compression results in some but small size reduction because the original documents are small. Disabling storing _source saves 1x the space (with added cost at query time to pull individual records), so this can bet me down to index size at 3-5x the original raw data.

I have tested no_analyzer settings, set term_vectors=no (the default), omit_norms=true (turned off boosting), omit_terms_freq_and_positions=true, and store=no (the default). With small document sizes, these various parameters don’t yield that much index size reduction, less than 10% in my case. I was thinking next test would be to define custom analyzer and token filters specific to my data, but don’t anticipate gaining significant index size reduction.

Is there a way to achieve compression beyond individual documents (such as http://www.nearinfinity.com/blogs/aaron_mccurry/lucene_compression.html). Or any way to compress the indices themselves so we could actually see a reduction in total disk used vs. raw data indexed (see http://code.google.com/p/lucenetransform/)?


(Shay Banon) #2

On Sat, Jul 23, 2011 at 12:00 PM, dottom dottom@gmail.com wrote:

Would it be possible to eventually compress data beyond _source
compression?
I am indexing a lot of small documents 200-400 bytes in size.

Using default settings, the index is 4-6x the size of the raw text data
(range depends on what fields I add). Turning on _source compression
results in some but small size reduction because the original documents are
small. Disabling storing _source saves 1x the space (with added cost at
query time to pull individual records), so this can bet me down to index
size at 3-5x the original raw data.

I have tested no_analyzer settings, set term_vectors=no (the default),
omit_norms=true (turned off boosting), omit_terms_freq_and_positions=true,
and store=no (the default). With small document sizes, these various
parameters don’t yield that much index size reduction, less than 10% in my
case. I was thinking next test would be to define custom analyzer and
token
filters specific to my data, but don’t anticipate gaining significant index
size reduction.

Is there a way to achieve compression beyond individual documents (such as
http://www.nearinfinity.com/blogs/aaron_mccurry/lucene_compression.html).
Or any way to compress the indices themselves so we could actually see a
reduction in total disk used vs. raw data indexed (see
http://code.google.com/p/lucenetransform/)?

No.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Compression-options-for-small-documents-tp3193194p3193194.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Eks Dev) #3

ist your collection relatively static?
The best you can do is to use static huffman and/or static dictionary
compression (eventually with escapeing for new symbols)

you could try to compress document yourself into one binary field, but
in order to achieve reasonable performance and compression on such a
small documents you have to use some form of the static compression.

for example, if you have a field containing only numbers, you could
build static huffman to compress it at leeast 50% as you could store
10 simbols in 4 bits, or if you have a field with low cardinality
(e.g. zip code), you could use simple dictionary compression...

I think ES supports binary fields, but I do not know if there is any
infrastructure to plug in something like that into server (you can do
it on client code). I did it before on raw lucene stored field and
reduced total size to some 40% of the original

On Sat, Jul 23, 2011 at 11:00 AM, dottom dottom@gmail.com wrote:

Would it be possible to eventually compress data beyond _source compression?
I am indexing a lot of small documents 200-400 bytes in size.

Using default settings, the index is 4-6x the size of the raw text data
(range depends on what fields I add). Turning on _source compression
results in some but small size reduction because the original documents are
small. Disabling storing _source saves 1x the space (with added cost at
query time to pull individual records), so this can bet me down to index
size at 3-5x the original raw data.

I have tested no_analyzer settings, set term_vectors=no (the default),
omit_norms=true (turned off boosting), omit_terms_freq_and_positions=true,
and store=no (the default). With small document sizes, these various
parameters don’t yield that much index size reduction, less than 10% in my
case. I was thinking next test would be to define custom analyzer and token
filters specific to my data, but don’t anticipate gaining significant index
size reduction.

Is there a way to achieve compression beyond individual documents (such as
http://www.nearinfinity.com/blogs/aaron_mccurry/lucene_compression.html).
Or any way to compress the indices themselves so we could actually see a
reduction in total disk used vs. raw data indexed (see
http://code.google.com/p/lucenetransform/)?

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Compression-options-for-small-documents-tp3193194p3193194.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(dottom) #4
is your collection relatively static? The best you can do is to use static huffman and/or static dictionary compression (eventually with escapeing for new symbols)

There are some string values that appear repeatedly, and I've looked at tokenization of strings prior to indexing, where at the app layer we would handle tokenizing user queries and untokenization search results. The challenge here is that for tokenization to be effective for our data, we have to tokenize phrases, at which point we start losing some phrase match granularity since ES will only see the tokens, and not the words within the tokens.

you could try to compress document yourself into one binary field, but in order to achieve reasonable performance and compression on such a small documents you have to use some form of the static compression.

I think the underlying challenge is that the resulting raw data and associated terms/analyzed values/etc. is stored uncompressed in ES, except for _source which we can enable compress or disable storing altogether (and retrieve each result individually). So even in the best case we are probably always looking at 2-3x increase in disk capacity vs. original raw data. The problem would be a bit simplified if we were indexing large documents, such as web pages or office documents.

for example, if you have a field containing only numbers, you could build static huffman to compress it at leeast 50% as you could store 10 simbols in 4 bits, or if you have a field with low cardinality (e.g. zip code), you could use simple dictionary compression...

That's a good idea, and we can do some shortening with things like IP addresses, zip codes, credit card numbers, etc. but in the end it only saves a little disk space, and also prevents us from performing range searches on the data.

I think ES supports binary fields, but I do not know if there is any infrastructure to plug in something like that into server (you can do it on client code). I did it before on raw lucene stored field and reduced total size to some 40% of the original

We are looking at client-side grouping of documents to 1) reduce the amount of terms that are stored, and 2) take advantage of _source compression. The idea is to take say 1000 small documents (200-400 bytes each) and insert them as a single document. At search time, we will have to process the resulting data client-side. There are some drawbacks to this - the main issue being that Lucene query syntax is applied on the group of documents and complex queries requires us to do a lot of client-side processing. For example, we won't be able to provide an accurate count of matches because we have to take the results returned by ES and process each one individually. But for some limited search applications, this might work.


(system) #5