Elasticsearch index MUCH larger then similar lucene index

shlomivaknin · May 21, 2013, 3:53pm

Hey,

We have some old java code that uses lucene and grizzly to serve queries
over text. we have two field, a string field and a numeric (long) field.
the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and
indexed the same data.

the java based implementation took about 6gb, while to elastic took 17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Itamar_Syn_Hershko · May 21, 2013, 3:54pm

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomivaknin@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve queries
over text. we have two field, a string field and a numeric (long) field.
the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and
indexed the same data.

the java based implementation took about 6gb, while to elastic took 17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 21, 2013, 3:59pm

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi <shlomi...@gmail.com <javascript:>

wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve queries
over text. we have two field, a string field and a numeric (long) field.
the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration, and
indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael_Sick · May 21, 2013, 4:01pm

do you have replication on?
On May 21, 2013 9:59 AM, "Shlomi" shlomivaknin@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve queries
over text. we have two field, a string field and a numeric (long) field.
the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration,
and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Itamar_Syn_Hershko · May 21, 2013, 4:02pm

You can disable storing the source, but then you don't have stored fields
unless you specify you want that explicitly. And it costs more to load
several fields from store when you don't have source enabled.

On Tue, May 21, 2013 at 6:59 PM, Shlomi shlomivaknin@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve queries
over text. we have two field, a string field and a numeric (long) field.
the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration,
and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

mattweber · May 21, 2013, 4:03pm

Don't forget about the _all field. Also, if you don't store the source,
you need to explicitly set "store" to yes on your field mappings so you can
have them returned in the results.

On Tue, May 21, 2013 at 8:59 AM, Shlomi shlomivaknin@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve queries
over text. we have two field, a string field and a numeric (long) field.
the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration,
and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 21, 2013, 4:07pm

Hey,

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to
"yes" for both fields (i dont care about perf for now..) - with this
setting i still got a much larger size and was still unable to see the
fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

Don't forget about the _all field. Also, if you don't store the source,
you need to explicitly set "store" to yes on your field mappings so you can
have them returned in the results.

On Tue, May 21, 2013 at 8:59 AM, Shlomi <shlomi...@gmail.com <javascript:>

wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve
queries over text. we have two field, a string field and a numeric (long)
field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration,
and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 21, 2013, 4:10pm

here is a fraction of the mapping i have (i use clojure so its a bit
different from json, but its essentially the same):

       {:test  {        
                 :_source {:enabled "false" }
                 :_all    {:enabled "false" }
                 :properties {:gram  {:type "string" :store "yes"

:analyzer :ngram-index :compress "true"}
:freq {:type "long" :store "yes"}
}}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

Hey,

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store" to
"yes" for both fields (i dont care about perf for now..) - with this
setting i still got a much larger size and was still unable to see the
fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

Don't forget about the _all field. Also, if you don't store the source,
you need to explicitly set "store" to yes on your field mappings so you can
have them returned in the results.

On Tue, May 21, 2013 at 8:59 AM, Shlomi shlomi...@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve
queries over text. we have two field, a string field and a numeric (long)
field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration,
and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 22, 2013, 9:08am

does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although queriable)? or
am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:

here is a fraction of the mapping i have (i use clojure so its a bit
different from json, but its essentially the same):
       {:test  {        
                 :_source {:enabled "false" }
                 :_all    {:enabled "false" }
                 :properties {:gram  {:type "string" :store "yes" 
:analyzer :ngram-index :compress "true"}
:freq {:type "long" :store "yes"}
}}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

Hey,

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store"
to "yes" for both fields (i dont care about perf for now..) - with this
setting i still got a much larger size and was still unable to see the
fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

Don't forget about the _all field. Also, if you don't store the source,
you need to explicitly set "store" to yes on your field mappings so you can
have them returned in the results.

On Tue, May 21, 2013 at 8:59 AM, Shlomi shlomi...@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve
queries over text. we have two field, a string field and a numeric (long)
field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple configuration,
and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael_Sick · May 22, 2013, 1:13pm

Not sure about the storage - you might try Google Code Archive - Long-term storage for Google Code Project Hosting.
and GitHub - jprante/elasticsearch-skywalker: Skywalker for Elasticsearch is like Luke for Lucene to see into
your indicies. I have not used either but had bookmarked for just such an
occasion.

On Wed, May 22, 2013 at 5:08 AM, Shlomi shlomivaknin@gmail.com wrote:

does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although queriable)? or
am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a bit
different from json, but its essentially the same):
       {:test  {
                 :_source {:enabled "false" }
                 :_all    {:enabled "false" }
                 :properties {:gram  {:type "string" :store "yes"
:analyzer :ngram-index :compress "true"}
:freq {:type "long" :store
"yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

Hey,

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store"
to "yes" for both fields (i dont care about perf for now..) - with this
setting i still got a much larger size and was still unable to see the
fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

Don't forget about the _all field. Also, if you don't store the
source, you need to explicitly set "store" to yes on your field mappings so
you can have them returned in the results.

On Tue, May 21, 2013 at 8:59 AM, Shlomi shlomi...@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve
queries over text. we have two field, a string field and a numeric (long)
field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple
configuration, and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 22, 2013, 1:49pm

Hey Michael

I managed to find my fields (had to manually ask for them). The question
remains though, why is the database so big. ill give skywalker a chance,
see maybe it will shed some light on this situation..

The weird thing is that even though i disabled _source and _all, the size
remained the same... meaning 17gb instead of 7gb. thats a lot of wasted
space...

If anyone has any more ideas why elastic is so ridiculously large compared
to a straight forward lucene, i am very interested to hear

On Wednesday, May 22, 2013 4:13:03 PM UTC+3, Michael Sick wrote:

Not sure about the storage - you might try Google Code Archive - Long-term storage for Google Code Project Hosting.
and GitHub - jprante/elasticsearch-skywalker: Skywalker for Elasticsearch is like Luke for Lucene to see
into your indicies. I have not used either but had bookmarked for just such
an occasion.

On Wed, May 22, 2013 at 5:08 AM, Shlomi <shlomi...@gmail.com <javascript:>

wrote:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although queriable)? or
am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a bit
different from json, but its essentially the same):
       {:test  {        
                 :_source {:enabled "false" }
                 :_all    {:enabled "false" }
                 :properties {:gram  {:type "string" :store "yes" 
:analyzer :ngram-index :compress "true"}
:freq {:type "long" :store
"yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

Hey,

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set "store"
to "yes" for both fields (i dont care about perf for now..) - with this
setting i still got a much larger size and was still unable to see the
fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

Don't forget about the _all field. Also, if you don't store the
source, you need to explicitly set "store" to yes on your field mappings so
you can have them returned in the results.

On Tue, May 21, 2013 at 8:59 AM, Shlomi shlomi...@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt return
anything besides id. but in any case, even disabling source still gave me a
large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.com wrote:

Hey,

We have some old java code that uses lucene and grizzly to serve
queries over text. we have two field, a string field and a numeric (long)
field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple
configuration, and indexed the same data.

the java based implementation took about 6gb, while to elastic took
17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mark_Harwood1 · May 22, 2013, 2:10pm

How many shards do you have?

A multi-sharded ES index will cost you more than a single Lucene index due
to duplication of index terms./worse postings compression if the same
content is split across many lucene indexes.

On Wednesday, May 22, 2013 2:49:32 PM UTC+1, Shlomi wrote:

Hey Michael

I managed to find my fields (had to manually ask for them). The question
remains though, why is the database so big. ill give skywalker a chance,
see maybe it will shed some light on this situation..

The weird thing is that even though i disabled _source and _all, the size
remained the same... meaning 17gb instead of 7gb. thats a lot of wasted
space...

If anyone has any more ideas why elastic is so ridiculously large compared
to a straight forward lucene, i am very interested to hear

On Wednesday, May 22, 2013 4:13:03 PM UTC+3, Michael Sick wrote:
Not sure about the storage - you might try
Google Code Archive - Long-term storage for Google Code Project Hosting. and
GitHub - jprante/elasticsearch-skywalker: Skywalker for Elasticsearch is like Luke for Lucene to see into
your indicies. I have not used either but had bookmarked for just such an
occasion.

On Wed, May 22, 2013 at 5:08 AM, Shlomi shlomi...@gmail.com wrote:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although queriable)? or
am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a bit
different from json, but its essentially the same):
       {:test  {        
                 :_source {:enabled "false" }
                 :_all    {:enabled "false" }
                 :properties {:gram  {:type "string" :store "yes" 
:analyzer :ngram-index :compress "true"}
:freq {:type "long" :store
"yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

Hey,

thanks all, let me reply:

Michael - no, i set replicas to 0 (if that what you meant..)

Itamar & Matt - i disabled _all and _source, and explicitly set
"store" to "yes" for both fields (i dont care about perf for now..) - with
this setting i still got a much larger size and was still unable to see the
fields (although i set store to yes) through queries (only got id's back)

On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

Don't forget about the _all field. Also, if you don't store the
source, you need to explicitly set "store" to yes on your field mappings so
you can have them returned in the results.

On Tue, May 21, 2013 at 8:59 AM, Shlomi shlomi...@gmail.com wrote:

yes, so i was trying to exclude source, but then queries didnt
return anything besides id. but in any case, even disabling source still
gave me a large index..

any way to tell it to save just the fields?

On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar Syn-Hershko wrote:

Yes, because ES stores the entire source by default

On Tue, May 21, 2013 at 6:53 PM, Shlomi shlomi...@gmail.comwrote:

Hey,

We have some old java code that uses lucene and grizzly to serve
queries over text. we have two field, a string field and a numeric (long)
field. the indexing code is pretty straight forward.

I was trying to migrate this to elastic, pretty simple
configuration, and indexed the same data.

the java based implementation took about 6gb, while to elastic
took 17gb..

does this makes sense? what could i do about this?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David_G_Ortega · May 22, 2013, 2:10pm

I had the same size issue but was exactly what the colleagues have pointed,
_all and _source enabled, makes sense... they only thing I can think and
its so silly that really ashames me to ask you is "have you deleted the
index before apply the mapping?"

I also see that you are using ngrams... I suppose that you use that in the
vanilla lucene index...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · May 22, 2013, 2:10pm

Please note, Skywalker needs an update for 0.90 - it is still on 0.20.
The update is in progress.

Jörg

Am 22.05.13 15:49, schrieb Shlomi:

I managed to find my fields (had to manually ask for them). The
question remains though, why is the database so big. ill give
skywalker a chance, see maybe it will shed some light on this situation..

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · May 22, 2013, 2:22pm

You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:

here is a fraction of the mapping i have (i use clojure so its a
bit different from json, but its essentially the same):

           {:test  {
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store
"yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long"
:store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

    Hey,

    thanks all, let me reply:

    Michael - no, i set replicas to 0 (if that what you meant..)

    Itamar & Matt - i disabled _all and _source, and explicitly
    set "store" to "yes" for both fields (i dont care about perf
    for now..) - with this setting i still got a much larger size
    and was still unable to see the fields (although i set store
    to yes) through queries (only got id's back)

    On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

        Don't forget about the _all field.  Also, if you don't
        store the source, you need to explicitly set "store" to
        yes on your field mappings so you can have them returned
        in the results.


        On Tue, May 21, 2013 at 8:59 AM, Shlomi
        <shlomi...@gmail.com> wrote:

            yes, so i was trying to exclude source, but then
            queries didnt return anything besides id. but in any
            case, even disabling source still gave me a large index..

            any way to tell it to save just the fields?


            On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
            Syn-Hershko wrote:

                Yes, because ES stores the entire source by default


                On Tue, May 21, 2013 at 6:53 PM, Shlomi
                <shlomi...@gmail.com> wrote:

                    Hey,

                    We have some old java code that uses lucene
                    and grizzly to serve queries over text. we
                    have two field, a string field and a numeric
                    (long) field. the indexing code is pretty
                    straight forward.

                    I was trying to migrate this to elastic,
                    pretty simple configuration, and indexed the
                    same data.

                    the java based implementation took about 6gb,
                    while to elastic took 17gb..

                    does this makes sense? what could i do about
                    this?

                    Thanks!


                    -- 
                    You received this message because you are
                    subscribed to the Google Groups
                    "elasticsearch" group.
                    To unsubscribe from this group and stop
                    receiving emails from it, send an email to
                    elasticsearc...@googlegroups.com.

                    For more options, visit
                    https://groups.google.com/groups/opt_out
                    <https://groups.google.com/groups/opt_out>.



            -- 
            You received this message because you are subscribed
            to the Google Groups "elasticsearch" group.
            To unsubscribe from this group and stop receiving
            emails from it, send an email to
            elasticsearc...@googlegroups.com.
            For more options, visit
            https://groups.google.com/groups/opt_out
            <https://groups.google.com/groups/opt_out>.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

shlomivaknin · May 22, 2013, 5:44pm

Hey,

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram tokenizer,
":ngram-index" is a custom tokenizer that uses "lowercase" tokenizer, and a
list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index
exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there was
no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking..

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:

You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:

does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:

here is a fraction of the mapping i have (i use clojure so its a 
bit different from json, but its essentially the same): 

           {:test  { 
                     :_source {:enabled "false" } 
                     :_all    {:enabled "false" } 
                     :properties {:gram  {:type "string" :store 
"yes" :analyzer :ngram-index :compress "true"} 
                                      :freq    {:type "long" 
:store "yes"} }}}] 

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote: 

    Hey, 

    thanks all, let me reply: 

    Michael - no, i set replicas to 0 (if that what you meant..) 

    Itamar & Matt - i disabled _all and _source, and explicitly 
    set "store" to "yes" for both fields (i dont care about perf 
    for now..) - with this setting i still got a much larger size 
    and was still unable to see the fields (although i set store 
    to yes) through queries (only got id's back) 

    On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote: 

        Don't forget about the _all field.  Also, if you don't 
        store the source, you need to explicitly set "store" to 
        yes on your field mappings so you can have them returned 
        in the results. 


        On Tue, May 21, 2013 at 8:59 AM, Shlomi 
        <shlomi...@gmail.com> wrote: 

            yes, so i was trying to exclude source, but then 
            queries didnt return anything besides id. but in any 
            case, even disabling source still gave me a large

index..

            any way to tell it to save just the fields? 


            On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar 
            Syn-Hershko wrote: 

                Yes, because ES stores the entire source by default 


                On Tue, May 21, 2013 at 6:53 PM, Shlomi 
                <shlomi...@gmail.com> wrote: 

                    Hey, 

                    We have some old java code that uses lucene 
                    and grizzly to serve queries over text. we 
                    have two field, a string field and a numeric 
                    (long) field. the indexing code is pretty 
                    straight forward. 

                    I was trying to migrate this to elastic, 
                    pretty simple configuration, and indexed the 
                    same data. 

                    the java based implementation took about 6gb, 
                    while to elastic took 17gb.. 

                    does this makes sense? what could i do about 
                    this? 

                    Thanks! 


                    -- 
                    You received this message because you are 
                    subscribed to the Google Groups 
                    "elasticsearch" group. 
                    To unsubscribe from this group and stop 
                    receiving emails from it, send an email to 
                    elasticsearc...@googlegroups.com. 

                    For more options, visit 
                    https://groups.google.com/groups/opt_out 
                    <https://groups.google.com/groups/opt_out>. 



            -- 
            You received this message because you are subscribed 
            to the Google Groups "elasticsearch" group. 
            To unsubscribe from this group and stop receiving 
            emails from it, send an email to 
            elasticsearc...@googlegroups.com. 
            For more options, visit 
            https://groups.google.com/groups/opt_out 
            <https://groups.google.com/groups/opt_out>.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

mattweber · May 22, 2013, 6:27pm

Really we are just shooting in the dark here because of lack of information:

What version of ES? What version of lucene? What does your lucene index
settings (tokenizer, analyzers, etc) look like? Have you configured an ES
mapping identical to what you use in lucene? How are you measuring your
index size? Have your tried indexing a single document in lucene and ES
and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer settings,
index settings, etc and we might be able to figure this out for you.

Thanks,
Matt Weber

On Wed, May 22, 2013 at 10:44 AM, Shlomi shlomivaknin@gmail.com wrote:

Hey,

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram
tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase"
tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index
exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there
was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking..

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a
bit different from json, but its essentially the same):

           {:test  {
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store
"yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long"
:store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

    Hey,

    thanks all, let me reply:

    Michael - no, i set replicas to 0 (if that what you meant..)

    Itamar & Matt - i disabled _all and _source, and explicitly
    set "store" to "yes" for both fields (i dont care about perf
    for now..) - with this setting i still got a much larger size
    and was still unable to see the fields (although i set store
    to yes) through queries (only got id's back)

    On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

        Don't forget about the _all field.  Also, if you don't
        store the source, you need to explicitly set "store" to
        yes on your field mappings so you can have them returned
        in the results.


        On Tue, May 21, 2013 at 8:59 AM, Shlomi
        <shlomi...@gmail.com> wrote:

            yes, so i was trying to exclude source, but then
            queries didnt return anything besides id. but in any
            case, even disabling source still gave me a large
index..
            any way to tell it to save just the fields?


            On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
            Syn-Hershko wrote:

                Yes, because ES stores the entire source by default


                On Tue, May 21, 2013 at 6:53 PM, Shlomi
                <shlomi...@gmail.com> wrote:

                    Hey,

                    We have some old java code that uses lucene
                    and grizzly to serve queries over text. we
                    have two field, a string field and a numeric
                    (long) field. the indexing code is pretty
                    straight forward.

                    I was trying to migrate this to elastic,
                    pretty simple configuration, and indexed the
                    same data.

                    the java based implementation took about 6gb,
                    while to elastic took 17gb..

                    does this makes sense? what could i do about
                    this?

                    Thanks!


                    --
                    You received this message because you are
                    subscribed to the Google Groups
                    "elasticsearch" group.
                    To unsubscribe from this group and stop
                    receiving emails from it, send an email to
                    elasticsearc...@googlegroups.**com.

                    For more options, visit
                    https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
                    <https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>>.
            --
            You received this message because you are subscribed
            to the Google Groups "elasticsearch" group.
            To unsubscribe from this group and stop receiving
            emails from it, send an email to
            elasticsearc...@googlegroups.**com.
            For more options, visit
            https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
            <https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>>.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · May 22, 2013, 7:35pm

I suggest you provide your lucene FieldTypes and your mapping, run your
indexing against lucene and a single shard no-replica Elasticsearch
instance. Then optimize the index and provide the output of ls -al on the
index directory. it would also be interesting what exactly is "much
larger".

simon

On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:

Really we are just shooting in the dark here because of lack of
information:

What version of ES? What version of lucene? What does your lucene index
settings (tokenizer, analyzers, etc) look like? Have you configured an ES
mapping identical to what you use in lucene? How are you measuring your
index size? Have your tried indexing a single document in lucene and ES
and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer settings,
index settings, etc and we might be able to figure this out for you.

Thanks,
Matt Weber

On Wed, May 22, 2013 at 10:44 AM, Shlomi <shlomi...@gmail.com<javascript:>

wrote:
Hey,

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram
tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase"
tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index
exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there
was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking..

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a 
bit different from json, but its essentially the same): 

           {:test  { 
                     :_source {:enabled "false" } 
                     :_all    {:enabled "false" } 
                     :properties {:gram  {:type "string" :store 
"yes" :analyzer :ngram-index :compress "true"} 
                                      :freq    {:type "long" 
:store "yes"} }}}] 

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote: 

    Hey, 

    thanks all, let me reply: 

    Michael - no, i set replicas to 0 (if that what you meant..) 

    Itamar & Matt - i disabled _all and _source, and explicitly 
    set "store" to "yes" for both fields (i dont care about perf 
    for now..) - with this setting i still got a much larger size 
    and was still unable to see the fields (although i set store 
    to yes) through queries (only got id's back) 

    On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote: 

        Don't forget about the _all field.  Also, if you don't 
        store the source, you need to explicitly set "store" to 
        yes on your field mappings so you can have them returned 
        in the results. 


        On Tue, May 21, 2013 at 8:59 AM, Shlomi 
        <shlomi...@gmail.com> wrote: 

            yes, so i was trying to exclude source, but then 
            queries didnt return anything besides id. but in any 
            case, even disabling source still gave me a large 
index..
            any way to tell it to save just the fields? 


            On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar 
            Syn-Hershko wrote: 

                Yes, because ES stores the entire source by 
default
                On Tue, May 21, 2013 at 6:53 PM, Shlomi 
                <shlomi...@gmail.com> wrote: 

                    Hey, 

                    We have some old java code that uses lucene 
                    and grizzly to serve queries over text. we 
                    have two field, a string field and a numeric 
                    (long) field. the indexing code is pretty 
                    straight forward. 

                    I was trying to migrate this to elastic, 
                    pretty simple configuration, and indexed the 
                    same data. 

                    the java based implementation took about 6gb, 
                    while to elastic took 17gb.. 

                    does this makes sense? what could i do about 
                    this? 

                    Thanks! 


                    -- 
                    You received this message because you are 
                    subscribed to the Google Groups 
                    "elasticsearch" group. 
                    To unsubscribe from this group and stop 
                    receiving emails from it, send an email to 
                    elasticsearc...@googlegroups.**com. 

                    For more options, visit 
                    https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out> 
                    <https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>>. 
            -- 
            You received this message because you are subscribed 
            to the Google Groups "elasticsearch" group. 
            To unsubscribe from this group and stop receiving 
            emails from it, send an email to 
            elasticsearc...@googlegroups.**com. 
            For more options, visit 
            https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out> 
            <https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>>. 
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · May 23, 2013, 2:34pm

Just wanted to add that I always encountered the same issue with
Elasticsearch. Indices are almost twice as big despite aggressive trimming.
I have simply come to accept the issue as a fact and moved on.

--
Ivan

On Wed, May 22, 2013 at 12:35 PM, simonw
simon.willnauer@elasticsearch.comwrote:

I suggest you provide your lucene FieldTypes and your mapping, run your
indexing against lucene and a single shard no-replica Elasticsearch
instance. Then optimize the index and provide the output of ls -al on the
index directory. it would also be interesting what exactly is "much
larger".

simon

On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:
Really we are just shooting in the dark here because of lack of
information:

What version of ES? What version of lucene? What does your lucene index
settings (tokenizer, analyzers, etc) look like? Have you configured an ES
mapping identical to what you use in lucene? How are you measuring your
index size? Have your tried indexing a single document in lucene and ES
and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer
settings, index settings, etc and we might be able to figure this out for
you.

Thanks,
Matt Weber

On Wed, May 22, 2013 at 10:44 AM, Shlomi shlomi...@gmail.com wrote:
Hey,

Thanks for replying, ngram is the name of the field, and is pre-computed:

Jörg - I think i might have misled you, i am not using the ngram
tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase"
tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index
exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there
was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking..

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned
by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field
as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a
bit different from json, but its essentially the same):

           {:test  {
                     :_source {:enabled "false" }
                     :_all    {:enabled "false" }
                     :properties {:gram  {:type "string" :store
"yes" :analyzer :ngram-index :compress "true"}
                                      :freq    {:type "long"
:store "yes"} }}}]

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote:

    Hey,

    thanks all, let me reply:

    Michael - no, i set replicas to 0 (if that what you meant..)

    Itamar & Matt - i disabled _all and _source, and explicitly
    set "store" to "yes" for both fields (i dont care about perf
    for now..) - with this setting i still got a much larger size
    and was still unable to see the fields (although i set store
    to yes) through queries (only got id's back)

    On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote:

        Don't forget about the _all field.  Also, if you don't
        store the source, you need to explicitly set "store" to
        yes on your field mappings so you can have them returned
        in the results.


        On Tue, May 21, 2013 at 8:59 AM, Shlomi
        <shlomi...@gmail.com> wrote:

            yes, so i was trying to exclude source, but then
            queries didnt return anything besides id. but in any
            case, even disabling source still gave me a large
index..
            any way to tell it to save just the fields?


            On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar
            Syn-Hershko wrote:

                Yes, because ES stores the entire source by
default
                On Tue, May 21, 2013 at 6:53 PM, Shlomi
                <shlomi...@gmail.com> wrote:

                    Hey,

                    We have some old java code that uses lucene
                    and grizzly to serve queries over text. we
                    have two field, a string field and a numeric
                    (long) field. the indexing code is pretty
                    straight forward.

                    I was trying to migrate this to elastic,
                    pretty simple configuration, and indexed the
                    same data.

                    the java based implementation took about 6gb,
                    while to elastic took 17gb..

                    does this makes sense? what could i do about
                    this?

                    Thanks!


                    --
                    You received this message because you are
                    subscribed to the Google Groups
                    "elasticsearch" group.
                    To unsubscribe from this group and stop
                    receiving emails from it, send an email to
                    elasticsearc...@googlegroups.**c**om.

                    For more options, visit
                    https://groups.google.com/**grou**ps/opt_out<https://groups.google.com/groups/opt_out>
                    <https://groups.google.com/**gro**ups/opt_out<https://groups.google.com/groups/opt_out>>.
            --
            You received this message because you are subscribed
            to the Google Groups "elasticsearch" group.
            To unsubscribe from this group and stop receiving
            emails from it, send an email to
            elasticsearc...@googlegroups.**c**om.
            For more options, visit
            https://groups.google.com/**grou**ps/opt_out<https://groups.google.com/groups/opt_out>
            <https://groups.google.com/**gro**ups/opt_out<https://groups.google.com/groups/opt_out>>.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jerome_Gagnon · May 23, 2013, 2:54pm

+1 on that, we couldn't do much about it, we just hope that this doesn't
affect the disk IO performance...

On Thursday, May 23, 2013 10:34:38 AM UTC-4, Ivan Brusic wrote:

Just wanted to add that I always encountered the same issue with
Elasticsearch. Indices are almost twice as big despite aggressive trimming.
I have simply come to accept the issue as a fact and moved on.

--
Ivan

On Wed, May 22, 2013 at 12:35 PM, simonw <simon.w...@elasticsearch.com<javascript:>

wrote:
I suggest you provide your lucene FieldTypes and your mapping, run your
indexing against lucene and a single shard no-replica Elasticsearch
instance. Then optimize the index and provide the output of ls -al on the
index directory. it would also be interesting what exactly is "much
larger".

simon

On Wednesday, May 22, 2013 8:27:05 PM UTC+2, Matt Weber wrote:
Really we are just shooting in the dark here because of lack of
information:

What version of ES? What version of lucene? What does your lucene
index settings (tokenizer, analyzers, etc) look like? Have you configured
an ES mapping identical to what you use in lucene? How are you measuring
your index size? Have your tried indexing a single document in lucene and
ES and comparing the resulting index size?

Gist us your mapping (not the clojure version) , custom analyzer
settings, index settings, etc and we might be able to figure this out for
you.

Thanks,
Matt Weber

On Wed, May 22, 2013 at 10:44 AM, Shlomi shlomi...@gmail.com wrote:
Hey,

Thanks for replying, ngram is the name of the field, and is
pre-computed:

Jörg - I think i might have misled you, i am not using the ngram
tokenizer, ":ngram-index" is a custom tokenizer that uses "lowercase"
tokenizer, and a list of stopwords.

David - Thanks for the suggestion, but yeah, my code fails if the index
exists before it runs, this way i am sure the index was in fact deleted..

Mark - I tried with both a single shard and the default 5 shards. there
was no different in size (surprisingly.. )

thanks for all your responses, but we have to keep thinking..

On Wednesday, May 22, 2013 5:22:53 PM UTC+3, Jörg Prante wrote:
You are using ngram tokenizer which explodes index size. If you use ES
default sharding, you have 5 shards (and therefore, 5 Lucene indexes).
With ngram, you have scattered tokens over all shards, and this
converges to 5x the space compared to 1 shard.

Also, store = yes for each field is kind of clumsy. You have to enable
each field to get them returned for a query (only _source is returned
by
default). I don't see much sense in making an ngram analyzed field
stored. Can you elaborate?

Jörg

Am 22.05.13 11:08, schrieb Shlomi:
does ES store its numeric fields as strings?

can someone confirm that if you disable _source and keep each field
as
stored and indexed, your fields becomes invisible (although
queriable)? or am i doing something totally wrong?..

Thanks

On Tuesday, May 21, 2013 7:10:07 PM UTC+3, Shlomi wrote:
here is a fraction of the mapping i have (i use clojure so its a 
bit different from json, but its essentially the same): 

           {:test  { 
                     :_source {:enabled "false" } 
                     :_all    {:enabled "false" } 
                     :properties {:gram  {:type "string" :store 
"yes" :analyzer :ngram-index :compress "true"} 
                                      :freq    {:type "long" 
:store "yes"} }}}] 

On Tuesday, May 21, 2013 7:07:44 PM UTC+3, Shlomi wrote: 

    Hey, 

    thanks all, let me reply: 

    Michael - no, i set replicas to 0 (if that what you meant..) 

    Itamar & Matt - i disabled _all and _source, and explicitly 
    set "store" to "yes" for both fields (i dont care about perf 
    for now..) - with this setting i still got a much larger 
size
    and was still unable to see the fields (although i set store 
    to yes) through queries (only got id's back) 

    On Tuesday, May 21, 2013 7:03:19 PM UTC+3, Matt Weber wrote: 

        Don't forget about the _all field.  Also, if you don't 
        store the source, you need to explicitly set "store" to 
        yes on your field mappings so you can have them returned 
        in the results. 


        On Tue, May 21, 2013 at 8:59 AM, Shlomi 
        <shlomi...@gmail.com> wrote: 

            yes, so i was trying to exclude source, but then 
            queries didnt return anything besides id. but in any 
            case, even disabling source still gave me a large 
index..
            any way to tell it to save just the fields? 


            On Tuesday, May 21, 2013 6:54:38 PM UTC+3, Itamar 
            Syn-Hershko wrote: 

                Yes, because ES stores the entire source by 
default
                On Tue, May 21, 2013 at 6:53 PM, Shlomi 
                <shlomi...@gmail.com> wrote: 

                    Hey, 

                    We have some old java code that uses lucene 
                    and grizzly to serve queries over text. we 
                    have two field, a string field and a numeric 
                    (long) field. the indexing code is pretty 
                    straight forward. 

                    I was trying to migrate this to elastic, 
                    pretty simple configuration, and indexed the 
                    same data. 

                    the java based implementation took about 
6gb,
                    while to elastic took 17gb.. 

                    does this makes sense? what could i do about 
                    this? 

                    Thanks! 


                    -- 
                    You received this message because you are 
                    subscribed to the Google Groups 
                    "elasticsearch" group. 
                    To unsubscribe from this group and stop 
                    receiving emails from it, send an email to 
                    elasticsearc...@googlegroups.**c**om. 

                    For more options, visit 
                    https://groups.google.com/**grou**ps/opt_out<https://groups.google.com/groups/opt_out> 
                    <https://groups.google.com/**gro**
ups/opt_out https://groups.google.com/groups/opt_out>.
            -- 
            You received this message because you are subscribed 
            to the Google Groups "elasticsearch" group. 
            To unsubscribe from this group and stop receiving 
            emails from it, send an email to 
            elasticsearc...@googlegroups.**c**om. 
            For more options, visit 
            https://groups.google.com/**grou**ps/opt_out<https://groups.google.com/groups/opt_out> 
            <https://groups.google.com/**gro**ups/opt_out<https://groups.google.com/groups/opt_out>>. 
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Lucene vs elasticsearch file size Elasticsearch	5	355	July 6, 2017
Text search with large text (e.g. finance news?) Elasticsearch	1	369	July 6, 2017
Elastic search has less results than lucene Elasticsearch	3	320	July 6, 2017
Index size improvements in 0.90? Elasticsearch	6	334	July 6, 2017
How elasticsearch supports regex search and its performance Elasticsearch	3	2674	July 6, 2017

Elasticsearch index MUCH larger then similar lucene index

Related topics