In ES >=6.0, is sparsity for doc_values & norms still bad?

Hello,

ES 6.0 and up uses Lucene 7, where doc values and norms have switched from random access API to iterator API. What this means, as I understand it, is that if a document does not have a given field that other docs in the index have, we'll no longer have to pay in disk space for that field.

Given the above, is the advice in ES General Recommendations, under heading Avoid Sparsity, still relevant? It's still there in documentation for ES >=6.0, and says things like:

In practice, this means that if an index has M documents,
norms will require M bytes of storage per field, even for fields
that only appear in a small fraction of the documents of the index.
Although slightly more complex with doc values due to the fact that
doc values have multiple ways that they can be encoded depending
on the type of field and on the actual data that the field stores, 
the problem is very similar.

Thank you,
Jan

1 Like

This blog post provides some additional detail and guidelines around the change.

Thanks for that. So, you would confirm that the specific bit of explanation why sparsity is bad - the one I quote above from the 6.0 docs - is now obsolete, right? If so, where can I file a request to have it removed?

I am not sure which part that are still valid, but you can open an issue against the docs repo on GitHub for a clarification or change.

Submitted https://github.com/elastic/elasticsearch/issues/30833 (Against elastic/elasticsearch, not elastic/docs - the latter says "If you find an error in the documentation, you should open an issue or pull request on the repository which contains the docs. For instance, the elasticsearch docs can be found in the main elasticsearch repository.")

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.