I want to analyze the same content three different ways and have the
ability to return highlighting information in the query results. In order
to do highlighting the fields need to be stored. The mapping below contains
a multi-field indexed two different ways. The same content is copied to the
third standalone field and indexed a different way (just pretend that it
can't be moved into the multi-field). My question: Will this content be
stored once, twice, or three times?
In that example, all three fields are going to be indexed and stored
separately, so the content will be duplicated 3 times. However, I would
expect compression to handle that efficiently.
Even if the three fields are compressed isn't it still storing three
compressed copies of the same thing? That is still three times more
overhead than it needs to be using. It seems very wasteful of space.
Ideally the space used by the database would be
size_of_stored_fields_compressed + size_of_index. In my case my database
will look more like (size_of_stored_fields_compressed x 3) + size_of_index.
This greatly increases my storage requirements!
If I enabled the type's _source field and disabled individual field storage
could I still get highlighting info in the query response for those fields?
Even if the three fields are compressed isn't it still storing three
compressed copies of the same thing? That is still three times more
overhead than it needs to be using. It seems very wasteful of space.
Ideally the space used by the database would be
size_of_stored_fields_compressed + size_of_index. In my case my database
will look more like (size_of_stored_fields_compressed x 3) + size_of_index.
This greatly increases my storage requirements!
It is not storing 3 compressed copies of the same thing, but storing these
3 things (as a whole) compressed. The difference is important because it
means that the 2nd and 3rd copies are effectively stored as references to
the first field value. I would recommend building two indices, once with
the copy_fields, and once without to see what the difference is in practice.
If I enabled the type's _source field and disabled individual field
storage could I still get highlighting info in the query response for those
fields?
Yes, although I would recommend keeping _source if possible. It makes lots
of things easier, for example you can reindex from elasticsearch itself,
etc.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.