Representation of non-existent field

3ygun · July 11, 2016, 9:15pm

If I have a mapping of:

GET /my_index/_mapping
    {
      "mappings": {
          "properties": {
             "name": { "type": "string" },
             "favNum": { "type": "integer" }
          }
        }
    }

With inputs of:

POST /my_index/posts/_bulk
{ "index": { "_id": "1"              }}
{ "name": "foo", "favNum": 13 }
{ "index": { "_id": "2"              }}
{ "name":  "bar" }

Would "bar"s foo property take up any data/space in Elasticsearch? Or would it's nonexistent be determined when compared against the mapping?

In SQL it would look something like this:

name | favNum
-----|-------
foo  | 13
bar  | null

The question comes from the fact that I want to make fields 1 to n but a specific index id may only have n-1 filled with data. I'm aware of this Dealing with Null Values but I don't believe it contains the answer to my question.

polyfractal · July 11, 2016, 10:17pm

So there are two datastructures that are important here: inverted index and doc values.

The inverted index is responsible for search. It contains a mapping of term -> documents for each field. In this data structure, null (e.g. missing) values are not stored. There is no way to represent a missing value inside an inverted index unless you index a placeholder value (which is what that Dealing with Null Values chapter talks about).

So for search, the overhead is relatively small.

Doc values are a different story, however. Doc values are a column-stride data structure used for aggregations. These contain the opposite mapping of data: document -> terms for each field.

In this data structure, null's have to be encoded as a missing value. So sparsity will cause extra overhead for doc values. The exact amount is hard to say, since Lucene has a lot of tricks to minimize the impact (alternate encoding when the majority of the segment is sparse, minimizing the number of bits needed for each value, etc). But overall, the answer here is "yes", sparsity is not free.

In general, it's best to avoid indices that have excessive sparsity. If you have many different types of data, prefer to keep them in their own indices. That'll keep lucene happier, reduce disk usage and lead to better performance.

3ygun · July 11, 2016, 10:32pm

Exactly what I needed to know! Thanks for the great insight into Elasticsearch's data representation.

Topic		Replies	Views
Unused mapping fields and impact on space/performance Elasticsearch	3	1577	February 25, 2019
Sparse Documents with "store" fields set to false Elasticsearch	4	902	July 5, 2017
Default values for missing fields Elasticsearch	7	5543	May 25, 2022
ElasticSearch Java API 8.5.0 fields are missing from indexed document Elasticsearch language-clients	8	965	December 8, 2022
Filtering on missing field not working Elasticsearch	4	2173	July 5, 2017

Representation of non-existent field

Related topics