The effect of multi-fields and copy_to on storage size

I want to analyze the same content three different ways and have the
ability to return highlighting information in the query results. In order
to do highlighting the fields need to be stored. The mapping below contains
a multi-field indexed two different ways. The same content is copied to the
third standalone field and indexed a different way (just pretend that it
can't be moved into the multi-field). My question: Will this content be
stored once, twice, or three times?

{
"text_document": {
"_source": {
"enabled": false
},
"_all": {
"enabled": false
},
"properties": {
"body": {
"type": "string",
"store": true,
"analyzer": "standard",
"copy_to": "another_field",
"fields": {
"secondary": {
"type": "string",
"store": true,
"analyzer": "simple"
},
}
},
"another_field": {
"type": "string",
"store": true,
"analyzer": "snowball"
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b88388f2-8632-4a16-879c-39150019edfb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ideas anyone?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/21311c5e-c0d5-4896-8560-a24e1683b1fc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

In that example, all three fields are going to be indexed and stored
separately, so the content will be duplicated 3 times. However, I would
expect compression to handle that efficiently.

On Wed, Apr 30, 2014 at 10:07 PM, Jeremy McLain gongchengshi@gmail.comwrote:

Ideas anyone?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/21311c5e-c0d5-4896-8560-a24e1683b1fc%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/21311c5e-c0d5-4896-8560-a24e1683b1fc%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6im-907DJwjprGBkwuBEhk%3DAvVFVtkTzKnnuOZf0RnOQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Even if the three fields are compressed isn't it still storing three
compressed copies of the same thing? That is still three times more
overhead than it needs to be using. It seems very wasteful of space.
Ideally the space used by the database would be
size_of_stored_fields_compressed + size_of_index. In my case my database
will look more like (size_of_stored_fields_compressed x 3) + size_of_index.
This greatly increases my storage requirements!

If I enabled the type's _source field and disabled individual field storage
could I still get highlighting info in the query response for those fields?

Thanks, Adrien, for your response.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6dec8a1c-e354-447d-82c0-cdd355a5afcc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jeremy,

On Mon, May 12, 2014 at 7:43 PM, Jeremy McLain gongchengshi@gmail.comwrote:

Even if the three fields are compressed isn't it still storing three
compressed copies of the same thing? That is still three times more
overhead than it needs to be using. It seems very wasteful of space.
Ideally the space used by the database would be
size_of_stored_fields_compressed + size_of_index. In my case my database
will look more like (size_of_stored_fields_compressed x 3) + size_of_index.
This greatly increases my storage requirements!

It is not storing 3 compressed copies of the same thing, but storing these
3 things (as a whole) compressed. The difference is important because it
means that the 2nd and 3rd copies are effectively stored as references to
the first field value. I would recommend building two indices, once with
the copy_fields, and once without to see what the difference is in practice.

If I enabled the type's _source field and disabled individual field
storage could I still get highlighting info in the query response for those
fields?

Yes, although I would recommend keeping _source if possible. It makes lots
of things easier, for example you can reindex from elasticsearch itself,
etc.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j67kqaP9SC8d__b7Qi4KGpX7kRAa0hxo6enKTdC7wvLxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.