Ok, my results are somewhat strange.
I tried everything from scratch, reindexed both using the pure java +
lucene implementation and using elastic.
Here is the java code for adding both fields:
Document document = new Document();
document.add(new Field("ngram", ngram, Field.Store.YES,
Field.Index.ANALYZED));
NumericField frequencyField = new NumericField("frequency",
Field.Store.YES, false);
frequencyField.setLongValue(frequency);
document.add(frequencyField);
Java uses lucene 3.5, and we are using a custom similiarity class
here is the elastic version (0.20.5) mappings:
{
"test": {
"_all": {
"enabled": "false"
},
"properties": {
"freq": {
"store": "yes",
"compress": "true",
"index" : "not_analyzed",
"type": "long"
},
"gram": {
"store": "yes",
"compress": "true",
"type": "string",
"analyzer": "ngram-index"
}
},
"_source": {
"enabled": "false"
}
}
}
and the settings:
{
"analysis": {
"filter": {
"myStop": {
"stopwords": [
"a", "b", "c" //// this is just for the example, there's a large
list here
],
"type": "stop"
}
},
"analyzer": {
"ngram-index": {
"tokenizer": "lowercase",
"filter": [
"myStop"
],
"type": "custom"
}
}
},
"similarity": {
"search": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
////// the same similarity used in the java version
},
"index": {
"type": "org.elasticsearch.index.similarity.CustomSimilarityProvider"
}
},
"number_of_shards": 1,
"number_of_replicas": 0
}
I ran both, and to my surprise, elasticsearch's size was actually smaller
then the pure java version.
ES: 15.8gb
Java: 17gb
but originally the size (what made me complain so loudly), was ~ 8gb!
So I sparked up luke, and went to the java-lucene index, where i performed
"optimize". it worked for a long time but the size remained the same.
I tried doing the same for ES, using curl -XPOST
'http://192.161.101.61:9200/test/_optimize', but besides returning
{"ok":true,"_shards":{"total":1,"successful":1,"failed":0}} right away, it
didnt do much (should i have played with the parameters?)
So i opened ES in luke too, and hit "optimize". let it work for quite the
while, and got back 32gb (???!!!)
Wondering through luke's options, i came across "cleanup index dir". ran it
over the java lucene index, and got back a wonderful 8.4 gb, so i tried it
also on the ES index, with high hopes. it got reduced back to 15.8gb and
stayed the same..
so now, here are ls -ltra of both dirs:
ES:
ls -ltra
total 16585720
drwxr-xr-x 5 elasticsearch elasticsearch 4096 May 27 15:21 ..
-rw-r--r-- 1 root root 16983734698 May 27 17:14 _38g.cfs
-rw-r--r-- 1 root root 285 May 27 17:26
segments_55
-rw-r--r-- 1 root root 20 May 27 17:26
segments.gen
drwxr-xr-x 2 elasticsearch elasticsearch 20480 May 27 17:26 .
java:
ls -ltra
total 8759956
-rw-rw-r-- 1 shlomiv shlomiv 8970150460 May 27 17:04 _an.cfs
-rw-rw-r-- 1 shlomiv shlomiv 20 May 27 17:04 segments.gen
-rw-rw-r-- 1 shlomiv shlomiv 284 May 27 17:04 segments_38
drwxrwxrwt 34 root root 20480 May 27 17:17 ..
drwxrwxr-x 2 shlomiv shlomiv 4096 May 27 17:17 .
so it seems that our old dev guys used to manually clean their java
generated lucene index with luke, which gave back around 8 gb of space.
but unfortunately, this trick didnt work on ES's lucene index.
what do you think? does this makes sense at all?
Thanks,
Shlomi
On Sunday, May 26, 2013 10:41:49 PM UTC+3, simonw wrote:
thanks! I'd be really happy to see the outcome!
On Sunday, May 26, 2013 12:06:38 PM UTC+2, Shlomi wrote:
Hey,
I am trying to pin-point the difference between the two implementations,
still working on replying to Matt and Simon. I am using Luke to see inside
the indices.
as soon as I have more complete results ill post them here..
On Friday, May 24, 2013 7:19:24 PM UTC+3, Otis Gospodnetic wrote:
Don't give up! This does matter and does affect performance (think disk
reads, think OS cache). There is _source, _all, compression, and other
factors that will affect index size, so it would be great to nail this down.
Otis
ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Servicehttp://sematext.com/spm/index.html
Search Analytics - Cloud Monitoring Tools & Services | Sematexthttp://sematext.com/search-analytics/index.html
On Thursday, May 23, 2013 10:54:05 AM UTC-4, Jérôme Gagnon wrote:
+1 on that, we couldn't do much about it, we just hope that this
doesn't affect the disk IO performance...
On Thursday, May 23, 2013 10:34:38 AM UTC-4, Ivan Brusic wrote:
Just wanted to add that I always encountered the same issue with
Elasticsearch. Indices are almost twice as big despite aggressive trimming.
I have simply come to accept the issue as a fact and moved on.
--
Ivan
On Wed, May 22, 2013 at 12:35 PM, simonw <simon.w...@elasticsearch.com
wrote:
I suggest you provide your lucene FieldTypes and your mapping, run
your indexing against lucene and a single shard no-replica Elasticsearch
instance. Then optimize the index and provide the output of ls -al on the
index directory. it would also be interesting what exactly is "much
larger".
simon
On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:
Really we are just shooting in the dark here because of lack of
information:
What version of ES? What version of lucene? What does your lucene
index settings (tokenizer, analyzers, etc) look like? Have you configured
an ES mapping identical to what you use in lucene? How are you measuring
your index size? Have your tried indexing a single document in lucene and
ES and comparing the resulting index size?
Gist us your mapping (not the clojure version) , custom analyzer
settings, index settings, etc and we might be able to figure this out for
you.
Thanks,
Matt Weber
On Wed, May 22, 2013 at 10:44 AM, Shlomi shlomi...@gmail.com
wrote:
Hey,
Thanks for replying, ngram is the name of the field, and is
pre-computed:
Jörg - I think i might have misled you, i am not using the ngram
tokenizer, ":ngram-index" is a custom tokenizer that uses
"lowercase" tokenizer, and a list of stopwords.
David - Thanks for the suggestion, but yeah, my code fails if the
index exists before it runs, this way i am sure the index was in fact
deleted..
Mark - I tried with both a single shard and the default 5 shards.
there was no different in size (surprisingly.. )
thanks for all your responses, but we have to keep thinking..
On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you
use ES
default sharding, you have 5 shards (and therefore, 5 Lucene
indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.
Also, store = yes for each field is kind of clumsy. You have to
enable
each field to get them returned for a query (only _source is
returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?
Jörg
Am 22.05.13 11:08, schrieb Shlomi:
does ES store its numeric fields as strings?
can someone confirm that if you disable _source and keep each
field as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..
Thanks
On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so
its a
bit different from json, but its essentially the same):
{:test {
:_source {:enabled "false" }
:_all {:enabled "false" }
:properties {:gram {:type "string"
:store
"yes" :analyzer :ngram-index :compress "true"}
:freq {:type "long"
:store "yes"} }}}]
On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
Hey,
thanks all, let me reply:
Michael - no, i set replicas to 0 (if that what you
meant..)
Itamar & Matt - i disabled _all and _source, and
explicitly
set "store" to "yes" for both fields (i dont care about
perf
for now..) - with this setting i still got a much larger
size
and was still unable to see the fields (although i set
store
to yes) through queries (only got id's back)
On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber
wrote:
Don't forget about the _all field. Also, if you
don't
store the source, you need to explicitly set "store"
to
yes on your field mappings so you can have them
returned
in the results.
On Tue, May 21, 2013 at 8:59 AM, Shlomi
<shlomi...@gmail.com> wrote:
yes, so i was trying to exclude source, but then
queries didnt return anything besides id. but in
any
case, even disabling source still gave me a
large index..
any way to tell it to save just the fields?
On Tuesday, May 21, 2013 6:54:38 PM UTC+3,
Itamar
Syn-Hershko wrote:
Yes, because ES stores the entire source by
default
On Tue, May 21, 2013 at 6:53 PM, Shlomi
<shlomi...@gmail.com> wrote:
Hey,
We have some old java code that uses
lucene
and grizzly to serve queries over text.
we
have two field, a string field and a
numeric
(long) field. the indexing code is
pretty
straight forward.
I was trying to migrate this to elastic,
pretty simple configuration, and indexed
the
same data.
the java based implementation took about
6gb,
while to elastic took 17gb..
does this makes sense? what could i do
about
this?
Thanks!
--
You received this message because you
are
subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop
receiving emails from it, send an email
to
elasticsearc...@googlegroups.**c**om.
For more options, visit
https://groups.google.com/**grou**
ps/opt_out https://groups.google.com/groups/opt_out
<https://groups.google.com/**gro**
ups/opt_out https://groups.google.com/groups/opt_out>.
--
You received this message because you are
subscribed
to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop
receiving
emails from it, send an email to
elasticsearc...@googlegroups.**c**om.
For more options, visit
https://groups.google.com/**grou**ps/opt_out<https://groups.google.com/groups/opt_out>
<https://groups.google.com/**gro**ups/opt_out<https://groups.google.com/groups/opt_out>
.
--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou
ps/opt_out https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
On Monday, May 27, 2013 4:21:59 PM UTC+3, Jérôme Gagnon wrote:
That is exactly what I was talking about. Actually I saw an improvement
going from 0.20.x to 0.90.x which is great ! I'm also waiting for the
outcome
On Friday, May 24, 2013 12:19:24 PM UTC-4, Otis Gospodnetic wrote:
Don't give up! This does matter and does affect performance (think disk
reads, think OS cache). There is _source, _all, compression, and other
factors that will affect index size, so it would be great to nail this down.
Otis
ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Servicehttp://sematext.com/spm/index.html
Search Analytics - Cloud Monitoring Tools & Services | Sematexthttp://sematext.com/search-analytics/index.html
On Thursday, May 23, 2013 10:54:05 AM UTC-4, Jérôme Gagnon wrote:
+1 on that, we couldn't do much about it, we just hope that this doesn't
affect the disk IO performance...
On Thursday, May 23, 2013 10:34:38 AM UTC-4, Ivan Brusic wrote:
Just wanted to add that I always encountered the same issue with
Elasticsearch. Indices are almost twice as big despite aggressive trimming.
I have simply come to accept the issue as a fact and moved on.
--
Ivan
On Wed, May 22, 2013 at 12:35 PM, simonw simon.w...@elasticsearch.comwrote:
I suggest you provide your lucene FieldTypes and your mapping, run
your indexing against lucene and a single shard no-replica Elasticsearch
instance. Then optimize the index and provide the output of ls -al on the
index directory. it would also be interesting what exactly is "much
larger".
simon
On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:
Really we are just shooting in the dark here because of lack of
information:
What version of ES? What version of lucene? What does your lucene
index settings (tokenizer, analyzers, etc) look like? Have you configured
an ES mapping identical to what you use in lucene? How are you measuring
your index size? Have your tried indexing a single document in lucene and
ES and comparing the resulting index size?
Gist us your mapping (not the clojure version) , custom analyzer
settings, index settings, etc and we might be able to figure this out for
you.
Thanks,
Matt Weber
On Wed, May 22, 2013 at 10:44 AM, Shlomi shlomi...@gmail.com wrote:
Hey,
Thanks for replying, ngram is the name of the field, and is
pre-computed:
Jörg - I think i might have misled you, i am not using the ngram
tokenizer, ":ngram-index" is a custom tokenizer that uses
"lowercase" tokenizer, and a list of stopwords.
David - Thanks for the suggestion, but yeah, my code fails if the
index exists before it runs, this way i am sure the index was in fact
deleted..
Mark - I tried with both a single shard and the default 5 shards.
there was no different in size (surprisingly.. )
thanks for all your responses, but we have to keep thinking..
On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use
ES
default sharding, you have 5 shards (and therefore, 5 Lucene
indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.
Also, store = yes for each field is kind of clumsy. You have to
enable
each field to get them returned for a query (only _source is
returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?
Jörg
Am 22.05.13 11:08, schrieb Shlomi:
does ES store its numeric fields as strings?
can someone confirm that if you disable _source and keep each
field as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..
Thanks
On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so
its a
bit different from json, but its essentially the same):
{:test {
:_source {:enabled "false" }
:_all {:enabled "false" }
:properties {:gram {:type "string"
:store
"yes" :analyzer :ngram-index :compress "true"}
:freq {:type "long"
:store "yes"} }}}]
On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:
Hey,
thanks all, let me reply:
Michael - no, i set replicas to 0 (if that what you
meant..)
Itamar & Matt - i disabled _all and _source, and
explicitly
set "store" to "yes" for both fields (i dont care about
perf
for now..) - with this setting i still got a much larger
size
and was still unable to see the fields (although i set
store
to yes) through queries (only got id's back)
On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber
wrote:
Don't forget about the _all field. Also, if you
don't
store the source, you need to explicitly set "store"
to
yes on your field mappings so you can have them
returned
in the results.
On Tue, May 21, 2013 at 8:59 AM, Shlomi
<shlomi...@gmail.com> wrote:
yes, so i was trying to exclude source, but then
queries didnt return anything besides id. but in
any
case, even disabling source still gave me a large
index..
any way to tell it to save just the fields?
On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
Syn-Hershko wrote:
Yes, because ES stores the entire source by
default
On Tue, May 21, 2013 at 6:53 PM, Shlomi
<shlomi...@gmail.com> wrote:
Hey,
We have some old java code that uses
lucene
and grizzly to serve queries over text.
we
have two field, a string field and a
numeric
(long) field. the indexing code is pretty
straight forward.
I was trying to migrate this to elastic,
pretty simple configuration, and indexed
the
same data.
the java based implementation took about
6gb,
while to elastic took 17gb..
does this makes sense? what could i do
about
this?
Thanks!
--
You received this message because you are
subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop
receiving emails from it, send an email
to
elasticsearc...@googlegroups.**c**om.
For more options, visit
https://groups.google.com/**grou**
ps/opt_out https://groups.google.com/groups/opt_out
<https://groups.google.com/**gro**
ups/opt_out https://groups.google.com/groups/opt_out>.
--
You received this message because you are
subscribed
to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to
elasticsearc...@googlegroups.**c**om.
For more options, visit
https://groups.google.com/**grou**ps/opt_out<https://groups.google.com/groups/opt_out>
<https://groups.google.com/**gro**ups/opt_out<https://groups.google.com/groups/opt_out>>.
--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou
ps/opt_out https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.