Elasticsearch term vector mapping options and index size

dkrasner · August 29, 2016, 12:36pm

ES allows for various options for which term_vector information is stored (all, with_positions_offsets, etc). The default option, i.e. not passing any explicit mappings, stores the same information as the with_positions_offsets option but has the smaller index size. Does anyone know why?

Here are some examples (in Sense):

default:

PUT test_default_text
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "string"
        }
      }
    }
  }
}


PUT test_default_text/doc/1
{
  "text": "The good news is that we brought even more improvements to the document store in Lucene 5.0. More and more users are indexing huge amounts of data and in such cases the bottleneck is often I/O, which can be improved by heavier compression. Lucene 5.0 still has the same default codec as Lucene 4.1 but now allows you to use DEFLATE (the compression algorithm behind zip, gzip and png) instead of LZ4, if you would like to have better compression. We know this is something which has been long awaited, especially by our logging users."
}

with_offsets_positions:

PUT test_full_text
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type":        "string",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}


PUT test_full_text/doc/1
{
  "text": "The good news is that we brought even more improvements to the document store in Lucene 5.0. More and more users are indexing huge amounts of data and in such cases the bottleneck is often I/O, which can be improved by heavier compression. Lucene 5.0 still has the same default codec as Lucene 4.1 but now allows you to use DEFLATE (the compression algorithm behind zip, gzip and png) instead of LZ4, if you would like to have better compression. We know this is something which has been long awaited, especially by our logging users."
}

store size:

GET test_default_text/_stats/store`

...
  "store": {
        "size_in_bytes": 5661,
        "throttle_time_in_millis": 0
      }
...


GET test_full_text/_stats/store`
  
...
  "store": {
        "size_in_bytes": 6373,
        "throttle_time_in_millis": 0
      }
...

The default mappings index is smaller in size but seems to contain the same information, i.e. submitting

GET test_default_text/doc/1/_termvectors?fields=text

returns term vector data with positions and offsets. Even setting "term_vector": "yes" creates a bigger index (here size: 6217) but returns only a subset of the term vector data default has, i.e. a "smaller" index is bigger in size.

This seems to be stable and even more pronounced on bigger indexes.

Does anyone understand what the issue is?

thanks!

PS I've posted the same question on SO, but this seems like a more appropriate place.

Topic		Replies	Views
Reindex with `term_vector` results in deleted docs Elasticsearch	1	369	February 17, 2019
Term Vector Setting in Mapping Elasticsearch	2	754	July 6, 2017
Way to avoid hitting max field length. term_vectors vs offsets Elasticsearch	1	601	August 23, 2019
"index_options"="docs" option and its relation with disk space Elasticsearch	2	358	October 2, 2019
New index_options vs term_vector? Elasticsearch	3	2223	July 6, 2017

Elasticsearch term vector mapping options and index size

Related topics