ES allows for various options for which term_vector
information is stored (all
, with_positions_offsets
, etc). The default option, i.e. not passing any explicit mappings, stores the same information as the with_positions_offsets
option but has the smaller index size. Does anyone know why?
Here are some examples (in Sense):
default:
PUT test_default_text
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string"
}
}
}
}
}
PUT test_default_text/doc/1
{
"text": "The good news is that we brought even more improvements to the document store in Lucene 5.0. More and more users are indexing huge amounts of data and in such cases the bottleneck is often I/O, which can be improved by heavier compression. Lucene 5.0 still has the same default codec as Lucene 4.1 but now allows you to use DEFLATE (the compression algorithm behind zip, gzip and png) instead of LZ4, if you would like to have better compression. We know this is something which has been long awaited, especially by our logging users."
}
with_offsets_positions:
PUT test_full_text
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets"
}
}
}
}
}
PUT test_full_text/doc/1
{
"text": "The good news is that we brought even more improvements to the document store in Lucene 5.0. More and more users are indexing huge amounts of data and in such cases the bottleneck is often I/O, which can be improved by heavier compression. Lucene 5.0 still has the same default codec as Lucene 4.1 but now allows you to use DEFLATE (the compression algorithm behind zip, gzip and png) instead of LZ4, if you would like to have better compression. We know this is something which has been long awaited, especially by our logging users."
}
store size:
GET test_default_text/_stats/store`
...
"store": {
"size_in_bytes": 5661,
"throttle_time_in_millis": 0
}
...
GET test_full_text/_stats/store`
...
"store": {
"size_in_bytes": 6373,
"throttle_time_in_millis": 0
}
...
The default mappings index is smaller in size but seems to contain the same information, i.e. submitting
GET test_default_text/doc/1/_termvectors?fields=text
returns term vector data with positions and offsets. Even setting "term_vector": "yes"
creates a bigger index (here size: 6217) but returns only a subset of the term vector data default has, i.e. a "smaller" index is bigger in size.
This seems to be stable and even more pronounced on bigger indexes.
Does anyone understand what the issue is?
thanks!
PS I've posted the same question on SO, but this seems like a more appropriate place.