Representation of non-existent field

If I have a mapping of:

GET /my_index/_mapping
    {
      "mappings": {
          "properties": {
             "name": { "type": "string" },
             "favNum": { "type": "integer" }
          }
        }
    }

With inputs of:

POST /my_index/posts/_bulk
{ "index": { "_id": "1"              }}
{ "name": "foo", "favNum": 13 }
{ "index": { "_id": "2"              }}
{ "name":  "bar" }

Would "bar"s foo property take up any data/space in Elasticsearch? Or would it's nonexistent be determined when compared against the mapping?

In SQL it would look something like this:

name | favNum
-----|-------
foo  | 13
bar  | null

The question comes from the fact that I want to make fields 1 to n but a specific index id may only have n-1 filled with data. I'm aware of this Dealing with Null Values but I don't believe it contains the answer to my question.

So there are two datastructures that are important here: inverted index and doc values.

The inverted index is responsible for search. It contains a mapping of term -> documents for each field. In this data structure, null (e.g. missing) values are not stored. There is no way to represent a missing value inside an inverted index unless you index a placeholder value (which is what that Dealing with Null Values chapter talks about).

So for search, the overhead is relatively small.

Doc values are a different story, however. Doc values are a column-stride data structure used for aggregations. These contain the opposite mapping of data: document -> terms for each field.

In this data structure, null's have to be encoded as a missing value. So sparsity will cause extra overhead for doc values. The exact amount is hard to say, since Lucene has a lot of tricks to minimize the impact (alternate encoding when the majority of the segment is sparse, minimizing the number of bits needed for each value, etc). But overall, the answer here is "yes", sparsity is not free.

In general, it's best to avoid indices that have excessive sparsity. If you have many different types of data, prefer to keep them in their own indices. That'll keep lucene happier, reduce disk usage and lead to better performance.

1 Like

Exactly what I needed to know! Thanks for the great insight into Elasticsearch's data representation.

1 Like