Elasticsearch index MUCH larger then similar lucene index

shlomivaknin · May 30, 2013, 10:20am

I copied the wrong line before...

ES was actually:

curl -XPOST 'http://host:9200/test/_optimize?max_num_segments=1'
{"ok":true,"_shards":{"total":1,"successful":1,"failed":0}}
*
*
just to not throw you off in the wrong direction...

On Thursday, May 30, 2013 1:02:25 PM UTC+3, Shlomi wrote:

Israel, sorry for any inconvenience my thread has caused you.

now back to the really annoying results:

ES version :

curl -XPOST 'http://host:9200/test/_optimize?max_num_segments=1'
{"ok":true,"_shards":{"total":0,"successful":0,"failed":0}}

ls -ltra
total 16623176
drwxr-xr-x 5 elasticsearch elasticsearch 4096 May 27 17:34 ..
-rw-r--r-- 1 elasticsearch elasticsearch 31 May 27 18:16 _1s0.fnm
-rw-r--r-- 1 elasticsearch elasticsearch 240390660 May 27 18:16 _1s0.fdx
-rw-r--r-- 1 elasticsearch elasticsearch 2178157235 May 27 18:16 _1s0.fdt
-rw-r--r-- 1 elasticsearch elasticsearch 742546522 May 27 18:17 _1s0.tis
-rw-r--r-- 1 elasticsearch elasticsearch 7152131 May 27 18:17 _1s0.tii
-rw-r--r-- 1 elasticsearch elasticsearch 440466009 May 27 18:17 _1s0.prx
-rw-r--r-- 1 elasticsearch elasticsearch 1017914310 May 27 18:17 _1s0.frq
-rw-r--r-- 1 elasticsearch elasticsearch 30048836 May 27 18:17 _1s0.nrm
-rw-r--r-- 1 elasticsearch elasticsearch 31 May 27 18:38 _2oj.fnm
-rw-r--r-- 1 elasticsearch elasticsearch 2149916547 May 27 18:38 _2oj.fdt
-rw-r--r-- 1 elasticsearch elasticsearch 238283772 May 27 18:38 _2oj.fdx
-rw-r--r-- 1 elasticsearch elasticsearch 735613612 May 27 18:39 _2oj.tis
-rw-r--r-- 1 elasticsearch elasticsearch 7082393 May 27 18:39 _2oj.tii
-rw-r--r-- 1 elasticsearch elasticsearch 434339734 May 27 18:39 _2oj.prx
-rw-r--r-- 1 elasticsearch elasticsearch 1005557319 May 27 18:39 _2oj.frq
-rw-r--r-- 1 elasticsearch elasticsearch 29785475 May 27 18:39 _2oj.nrm
-rw-r--r-- 1 elasticsearch elasticsearch 0 May 30 11:49 write.lock
-rw-r--r-- 1 elasticsearch elasticsearch 31 May 30 11:50 _37c.fnm
-rw-r--r-- 1 elasticsearch elasticsearch 402061692 May 30 11:50 _37c.fdx
-rw-r--r-- 1 elasticsearch elasticsearch 3636925770 May 30 11:50 _37c.fdt
-rw-r--r-- 1 elasticsearch elasticsearch 1229530031 May 30 11:52 _37c.tis
-rw-r--r-- 1 elasticsearch elasticsearch 11770457 May 30 11:52 _37c.tii
-rw-r--r-- 1 elasticsearch elasticsearch 735561692 May 30 11:52 _37c.prx
-rw-r--r-- 1 elasticsearch elasticsearch 1698617265 May 30 11:52 _37c.frq
-rw-r--r-- 1 elasticsearch elasticsearch 50257715 May 30 11:53 _37c.nrm
-rw-r--r-- 1 elasticsearch elasticsearch 828 May 30 11:53
segments_4n
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 30 11:53
segments.gen
-rw-r--r-- 1 elasticsearch elasticsearch 138 May 30 11:53
_checksums-1369903982814
drwxr-xr-x 2 elasticsearch elasticsearch 20480 May 30 11:53 .

java version after optimize to normal file format setting max_segments=1:

ls -ltr

total 8759876
-rw-rw-r-- 1 shlomiv shlomiv 24 May 30 11:34 _ao.fnm
-rw-rw-r-- 1 shlomiv shlomiv 883151756 May 30 11:35 _ao.fdx
-rw-rw-r-- 1 shlomiv shlomiv 4343895906 May 30 11:35 _ao.fdt
-rw-rw-r-- 1 shlomiv shlomiv 14132289 May 30 11:36 _ao.tis
-rw-rw-r-- 1 shlomiv shlomiv 197431 May 30 11:36 _ao.tii
-rw-rw-r-- 1 shlomiv shlomiv 506303552 May 30 11:36 _ao.prx
-rw-rw-r-- 1 shlomiv shlomiv 3111989398 May 30 11:36 _ao.frq
-rw-rw-r-- 1 shlomiv shlomiv 110393973 May 30 11:36 _ao.nrm
-rw-rw-r-- 1 shlomiv shlomiv 285 May 30 11:36 segments_38
-rw-rw-r-- 1 shlomiv shlomiv 20 May 30 11:36 segments.gen

still twice the size, after optimization. Israel is right, this saga is
really annoying

thanks a lot for your patience, i think its important to understand the
cause of this size increase, and not just for my sake

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jpountz · May 30, 2013, 10:26am

Your Elasticsearch index seems to have several segments (for example, an
optimized index would have only one .fdx file), are you sure you listed the
correct directory?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 30, 2013, 10:58am

i ran that again

curl -XPOST 'http://host:9200/test/_optimize?max_num_segments=1
{"ok":true,"_shards":{"total":1,"successful":1,"failed":0}}

and got the same listing.

so i ran optimize with luke, and that gave me back a better listing, still
of the same size..

ls -ltr
total 16586872
-rw-r--r-- 1 elasticsearch elasticsearch 31 May 30 13:44 _37d.fnm
-rw-r--r-- 1 elasticsearch elasticsearch 880736116 May 30 13:46 _37d.fdx
-rw-r--r-- 1 elasticsearch elasticsearch 7964999544 May 30 13:46 _37d.fdt
-rw-r--r-- 1 elasticsearch elasticsearch 2665101249 May 30 13:50 _37d.tis
-rw-r--r-- 1 elasticsearch elasticsearch 25370865 May 30 13:50 _37d.tii
-rw-r--r-- 1 elasticsearch elasticsearch 1610367435 May 30 13:50 _37d.prx
-rw-r--r-- 1 elasticsearch elasticsearch 3728236673 May 30 13:50 _37d.frq
-rw-r--r-- 1 elasticsearch elasticsearch 110092018 May 30 13:50 _37d.nrm
-rw-r--r-- 1 elasticsearch elasticsearch 313 May 30 13:50 segments_4o
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 30 13:50
segments.gen
-rw-r--r-- 1 elasticsearch elasticsearch 270 May 30 13:50
_checksums-1369911034916

thanks

On Thu, May 30, 2013 at 1:26 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Your Elasticsearch index seems to have several segments (for example, an
optimized index would have only one .fdx file), are you sure you listed the
correct directory?

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/6j0E-2pTbWg/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jpountz · May 30, 2013, 11:34am

OK, so the larger files are

fdt: stored fields
tis, tii: terms dictionary
prx: positions

So a few ideas:

Did you use the same analyzers?
Did you use the same index options?
Did you mark all your fields stored in your mapping? This would explain
why the fdt file is almost exactly 2x larger.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 30, 2013, 3:03pm

Hey

Did you use the same analyzers?

I think so.. in the java code i use WhitespaceTokenizer + LowerCaseFilter

FilteringTokenFilter

  WhitespaceTokenizer whitespaceTokenizer = new

WhitespaceTokenizer(Version.LUCENE_35, reader);
LowerCaseFilter lowerCaseFilter = new
LowerCaseFilter(Version.LUCENE_35, whitespaceTokenizer);
return new PunctuationFilter(false, lowerCaseFilter);

where
public class PunctuationFilter extends FilteringTokenFilter {....}

the first two should be equal to ES's
"lowercasehttp://www.elasticsearch.org/guide/reference/index-modules/analysis/lowercase-tokenizer/"
tokenizer (right?), and our implementation of FilteringTokenFilter excludes
exactly the same tokens as listed in our stop words filter

"analyzer": {
"ngram-index": {
"tokenizer": "lowercase",
"filter": [
"myStop" <--- contains the same exact list as used in our
custom FilteringTokenFilter, i know, i copy-pasted myself
],
"type": "custom"
}

Did you use the same index options?

I dont exactly understand what options you mean by index options...

Did you mark all your fields stored in your mapping? This would explain
why the fdt file is almost exactly 2x larger.

well, my fields are stored, but my _source and _all are not (i figured i
rather have the long be stored as long than have the original json saved,
to save space).
but with other attempts having store="no" and _source enabled gave me
pretty much similar results (I think maybe a few gb larger when used
_source)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jpountz · May 30, 2013, 3:18pm

Hi,

On Thu, May 30, 2013 at 5:03 PM, Shlomi Vaknin shlomivaknin@gmail.comwrote:

I dont exactly understand what options you mean by index options...

Index options are a way to tell Lucene whether positions and offsets should
be indexed for a given field.

Could you reindex 1% of your data and upload your Lucene and Elasticsearch
indexes somewhere? If you can, I'd be happy do have a deeper look at them
to better understand what happens.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 30, 2013, 3:33pm

Hey,

Thanks, i found index options, checking that out now.

about uploading a portion of my data, ill have to get back to you on monday
for this one. need to ask first

Thanks!

On Thu, May 30, 2013 at 6:18 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi,

On Thu, May 30, 2013 at 5:03 PM, Shlomi Vaknin shlomivaknin@gmail.comwrote:

I dont exactly understand what options you mean by index options...

Index options are a way to tell Lucene whether positions and offsets
should be indexed for a given field.

Could you reindex 1% of your data and upload your Lucene and Elasticsearch
indexes somewhere? If you can, I'd be happy do have a deeper look at them
to better understand what happens.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/6j0E-2pTbWg/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

kimchy · June 1, 2013, 12:46am

hi, I would add that make sure that the mapping that you think you are setting are actually set using the get mapping API on the live ES node.

also, I would try and remove all variable, and simply run it with a simple lucene program that does not use any custom analyzes or similarity, and the same withe ES. if you still see a difference, it will be much simpler to help.

last, simpler to ru optimize in ten lucene code down to a single segment, and same with ES (call optimize with max num segments set to 1)

really last, don't index that much, you can index 100mb and you should still see the difference

On Thu, May 30, 2013 at 5:34 PM, Shlomi Vaknin shlomivaknin@gmail.com
wrote:

Hey,
Thanks, i found index options, checking that out now.
about uploading a portion of my data, ill have to get back to you on monday
for this one. need to ask first
Thanks!
On Thu, May 30, 2013 at 6:18 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi,

On Thu, May 30, 2013 at 5:03 PM, Shlomi Vaknin shlomivaknin@gmail.comwrote:

I dont exactly understand what options you mean by index options...

Index options are a way to tell Lucene whether positions and offsets
should be indexed for a given field.

Could you reindex 1% of your data and upload your Lucene and Elasticsearch
indexes somewhere? If you can, I'd be happy do have a deeper look at them
to better understand what happens.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/6j0E-2pTbWg/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · June 6, 2013, 2:20pm

Hi,

I took some time off of this subject, and now i am back

I took Shay's advice, and removed all special stuff. I now use the
standardAnalyzer with no special similarity and no compound file and I use
LogByteSizeMergePolicy.
in elastic, i used head to check the index, here is what i got:
{

state: open
settings: {
- index.number_of_replicas: 0
- index.number_of_shards: 1
- index.version.created: 200699
  }
mappings: {
- test: {
  - _source: {
    - enabled: false
      }
  - properties: {
    - freq: {
      - store: yes
      - type: long
        }
    - gram: {
      - store: yes
      - type: string
        }
        }
  - _all: {
    - enabled: false
      }
      }
      }
aliases: [ ]

}

so i guess the types are as i set them.

you know what, i wont guess, here, i used the mapping api to get it:
curl -XGET 'http://es:9200/test/_mapping'

{"test":{"test":{"_all":{"enabled":false},"_source":{"enabled":false},"properties":{"freq":{"type":"long","store":"yes"},"gram":{"type":"string","store":"yes"}}}}}

ok, that looks the same. i hope i am not missing anything here..

i didnt specify any special mappings, analyzers, custom similarity or stop
words. everything is standard.

as suggested, i ran just a small sample, with a final
IndexWriter.optimize(1) on java and the relevant curl on ES.

here are the results:
java:

ls -ltra

total 329904
drwxrwxrwt 35 root root 20480 Jun 6 17:00 ..
-rw-rw-r-- 1 shlomiv shlomiv 24 Jun 6 17:01 _d.fnm
-rw-rw-r-- 1 shlomiv shlomiv 36640292 Jun 6 17:01 _d.fdx
-rw-rw-r-- 1 shlomiv shlomiv 165896465 Jun 6 17:01 _d.fdt
-rw-rw-r-- 1 shlomiv shlomiv 3341766 Jun 6 17:02 _d.tis
-rw-rw-r-- 1 shlomiv shlomiv 45880 Jun 6 17:02 _d.tii
-rw-rw-r-- 1 shlomiv shlomiv 11335166 Jun 6 17:02 _d.prx
-rw-rw-r-- 1 shlomiv shlomiv 115927867 Jun 6 17:02 _d.frq
-rw-rw-r-- 1 shlomiv shlomiv 4580040 Jun 6 17:02 _d.nrm
drwxrwxr-x 2 shlomiv shlomiv 4096 Jun 6 17:07 .

and ES:

ls -ltr
total 657340
-rw-r--r-- 1 elasticsearch elasticsearch 31 Jun 6 16:06 _4l.fnm
-rw-r--r-- 1 elasticsearch elasticsearch 36527308 Jun 6 16:06 _4l.fdx
-rw-r--r-- 1 elasticsearch elasticsearch 316074621 Jun 6 16:06 _4l.fdt
-rw-r--r-- 1 elasticsearch elasticsearch 117176603 Jun 6 16:07 _4l.tis
-rw-r--r-- 1 elasticsearch elasticsearch 1120316 Jun 6 16:07 _4l.tii
-rw-r--r-- 1 elasticsearch elasticsearch 56962669 Jun 6 16:07 _4l.prx
-rw-r--r-- 1 elasticsearch elasticsearch 140660806 Jun 6 16:07 _4l.frq
-rw-r--r-- 1 elasticsearch elasticsearch 4565917 Jun 6 16:07 _4l.nrm
-rw-r--r-- 1 elasticsearch elasticsearch 313 Jun 6 16:07 segments_b
-rw-r--r-- 1 elasticsearch elasticsearch 20 Jun 6 16:07 segments.gen
-rw-r--r-- 1 elasticsearch elasticsearch 1994 Jun 6 16:07
_checksums-1370524025378
-rw-r--r-- 1 elasticsearch elasticsearch 0 Jun 6 17:00 write.lock

can anyone kindly make a simple test and let this thread know if its just
that i am weird or can this spectacle be seen elsewhere?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · June 6, 2013, 4:37pm

Note, ES sets omit_norms and omit_terms_freq_and_positions to false by
default. When set to true, this saves some space.

Are you really comparing the correct Lucene versions? It could be you
are mixing 3.5, 3.6, 3.6.1

From the files in the Lucene version, it does not look like you are
storing many fields in there.

Jörg

Am 06.06.13 16:20, schrieb Shlomi:

Hi,

I took some time off of this subject, and now i am back

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · June 9, 2013, 10:06am

hey Jörg,

I checked and made sure both are using 3.6.2 (ES branch 0.20https://github.com/elasticsearch/elasticsearch/blob/0.20/pom.xml#L33),
and same appeared in my pom.xml .

about omit_norms etc, i made sure explicitly it would be the same in both
the java code and elastic mapping:
{
"test": {
"_all": {
"enabled": "false"
},
"properties": {
"freq": {
"store": "yes",
"compress": "true",
"index_options": "docs",
"omit_norms": "true",
"type": "long",
"index": "not_analyzed"
},
"gram": {
"store": "yes",
"compress": "true",
"index_options": "docs",
"omit_norms": "true",
"type": "string"
}
},
"_source": {
"enabled": "false"
}
}
}

and in the java code i have:

    Document document = new Document();

Field gram = new Field("ngram", ngram, Field.Store.YES,
Field.Index.ANALYZED);
gram.setOmitNorms(false);
gram.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);

    NumericField frequencyField = new NumericField("frequency",

Field.Store.YES, true);
frequencyField.setOmitNorms(true);
frequencyField.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
frequencyField.setLongValue(frequency);

    document.add(gram);
    document.add(frequencyField);

From the files in the Lucene version, it does not look like you are
storing many fields in there.

I only index two fields, gram and freq, if that is what you meant..

This settings still gives me 2x size on elastic. can anyone confirm this on
his data?

I think i should write a little something that shows this and put it on
github, for you to checkout, because i feel we are not getting anywhere..

Thanks for all your patience!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · June 9, 2013, 4:47pm

hey,

I made a repo on github https://github.com/vadali/es-vs-lucene, it
contains two folders, one for ES which is much different then the code i
currently use, for example this one uses plain REST api instead of
BulkProcessor, and another folder for lucene which is quite similar to the
code i currently use.

both of them uses leinhttps://github.com/technomancy/leiningen#installationas their build tool, and there are full instructions how to run each of
them in the readmehttps://github.com/vadali/es-vs-lucene/blob/master/README.md
.

this repo also contains a randomly generated data file, which exhibits the
same behavior i was reporting.
After optimizing with max_segments = 1, i get about 14mb on lucene and 20.5mb
on elastic. this is only a small dataset, so it doesnt seem like much, but
this gets meaningful as the dataset gets larger (and i have HUGE datasets..)

let me know if you had any problems to run this test, and if you have any
ideas regarding why do we see this size difference.

thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · June 10, 2013, 9:52pm

Your Java source code for Lucene shows IndexOptions.DOCS_ONLY. This is
equivalent to ES parameter omit_term_freq_and_positions = true. This is
missing in the ES mapping. I think there is a confusion about ES
index_options "docs" for Lucene 4, I assume ES does not pick it up.
Maybe it helps to use the following for Lucene 3.6.1 / ES 0.20

{
"test": {
"_all": {
"enabled": "false"
},
"_source": {
"enabled": "false"
},
"properties": {
"freq": {
"type": "long",
"store": "yes",
"omit_norms": "true",
"omit_terms_freq_and_positions" : "true"
},
"gram": {
"type": "string",
"store": "yes",
"omit_norms": "true",
"omit_terms_freq_and_positions" : "true"
}
}

Jörg

Am 09.06.13 12:06, schrieb Shlomi:

hey Jörg,

I checked and made sure both are using 3.6.2 (ES branch 0.20
https://github.com/elasticsearch/elasticsearch/blob/0.20/pom.xml#L33),
and same appeared in my pom.xml .

about omit_norms etc, i made sure explicitly it would be the same in
both the java code and elastic mapping:
{
"test": {
"_all": {
"enabled": "false"
},
"properties": {
"freq": {
"store": "yes",
"compress": "true",
"index_options": "docs",
"omit_norms": "true",
"type": "long",
"index": "not_analyzed"
},
"gram": {
"store": "yes",
"compress": "true",
"index_options": "docs",
"omit_norms": "true",
"type": "string"
}
},
"_source": {
"enabled": "false"
}
}
}

and in the java code i have:
    Document document = new Document();
Field gram = new Field("ngram", ngram, Field.Store.YES,
Field.Index.ANALYZED);
gram.setOmitNorms(false);
gram.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
    NumericField frequencyField = new NumericField("frequency", 
Field.Store.YES, true);
frequencyField.setOmitNorms(true);
frequencyField.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
frequencyField.setLongValue(frequency);
    document.add(gram);
    document.add(frequencyField);


 From the files in the Lucene version, it does not look like you
are storing many fields in there.
I only index two fields, gram and freq, if that is what you meant..

This settings still gives me 2x size on elastic. can anyone confirm
this on his data?

I think i should write a little something that shows this and put it
on github, for you to checkout, because i feel we are not getting
anywhere..

Thanks for all your patience!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · June 11, 2013, 12:09pm

Hey Yorg,

Thanks for checking this out!

you are right, i got confused by the docs
about omit_terms_freq_and_positions, it said it would be deprecated.
I also found another problem, where i shouldnt have omitted norms on the
gram field.

I fixed, pushed and ran the test, but the size didnt change..

On Tuesday, June 11, 2013 12:52:40 AM UTC+3, Jörg Prante wrote:

Your Java source code for Lucene shows IndexOptions.DOCS_ONLY. This is
equivalent to ES parameter omit_term_freq_and_positions = true. This is
missing in the ES mapping. I think there is a confusion about ES
index_options "docs" for Lucene 4, I assume ES does not pick it up.
Maybe it helps to use the following for Lucene 3.6.1 / ES 0.20

{
"test": {
"_all": {
"enabled": "false"
},
"_source": {
"enabled": "false"
},
"properties": {
"freq": {
"type": "long",
"store": "yes",
"omit_norms": "true",
"omit_terms_freq_and_positions" : "true"
},
"gram": {
"type": "string",
"store": "yes",
"omit_norms": "true",
"omit_terms_freq_and_positions" : "true"
}
}

Jörg

Am 09.06.13 12:06, schrieb Shlomi:

hey Jörg,

I checked and made sure both are using 3.6.2 (ES branch 0.20
https://github.com/elasticsearch/elasticsearch/blob/0.20/pom.xml#L33),
and same appeared in my pom.xml .

about omit_norms etc, i made sure explicitly it would be the same in
both the java code and elastic mapping:
{
"test": {
"_all": {
"enabled": "false"
},
"properties": {
"freq": {
"store": "yes",
"compress": "true",
"index_options": "docs",
"omit_norms": "true",
"type": "long",
"index": "not_analyzed"
},
"gram": {
"store": "yes",
"compress": "true",
"index_options": "docs",
"omit_norms": "true",
"type": "string"
}
},
"_source": {
"enabled": "false"
}
}
}

and in the java code i have:
    Document document = new Document(); 
Field gram = new Field("ngram", ngram, Field.Store.YES,
Field.Index.ANALYZED);
gram.setOmitNorms(false);
gram.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
    NumericField frequencyField = new NumericField("frequency", 
Field.Store.YES, true);
frequencyField.setOmitNorms(true);
frequencyField.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
frequencyField.setLongValue(frequency);
    document.add(gram); 
    document.add(frequencyField); 


 From the files in the Lucene version, it does not look like you 
are storing many fields in there. 
I only index two fields, gram and freq, if that is what you meant..

This settings still gives me 2x size on elastic. can anyone confirm
this on his data?

I think i should write a little something that shows this and put it
on github, for you to checkout, because i feel we are not getting
anywhere..

Thanks for all your patience!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Lucene vs elasticsearch file size Elasticsearch	5	399	July 6, 2017
Indices size Elasticsearch	4	616	July 6, 2017
Some interesting storage numbers for people interested Elasticsearch	7	410	July 6, 2017
Ingest performance degrades sharply along with the documents having more fileds Elasticsearch	25	1258	July 6, 2017
Elasticsearch index size increased but data source is still the same in ES 7.4 Elasticsearch	7	515	June 26, 2020

Elasticsearch index MUCH larger then similar lucene index

Related topics