Sparse Documents with "store" fields set to false

Mike_Wagner · September 27, 2016, 2:04am

Is there still an impact on performance with sparse documents when store is set to false for all fields (the default)? Based on my reading it would still affect the inverted index in Lucene and also increase the doc value size in Elasticsearch (we're using 2.3).

I'm trying to determine whether to have multiple indexes versus a single index when having multiple types with different fields. I understand in practice it's best to have an index per type, but would like some more concrete evidence to satisfy my customer's curiosity.

Is there a way to measure the sparsity of documents in Elasticsearch or Lucene?

jpountz · September 27, 2016, 8:51am

Actually stored fields (the feature that us used when you set store=true on a field) work rather well with sparsity. Sparsity is rather an issue with norms (used by analyzed string fields by default) and doc values (used by almost all fields but analyzed string fields by default). The reason is that these two data structures use a dense encoding where documents that do not have a value are going to use the same fixed amount of disk space as documents that have a value. This can be especially an issue with features that create sparsity implicitly like types and the nested type.

Mike_Wagner · September 27, 2016, 3:00pm

@jpountz - thanks a lot for the information!

Do you know if there is any way to measure the sparsity of fields and the affects on performance? I'm looking for some sort of threshold that should lead us to move to multiple indexes. I took the Elasticsearch developer training a few weeks ago and the instructors mentioned 1700 fields crashing Lucene indexes, but anything over 1000 was bad. Was hoping for some more concrete documentation or metrics to use to base our decision.

jpountz · September 27, 2016, 3:58pm

As usual it is hard to give any hard numbers. You can measure the sparsity by comparing the result of an exists query on a field with the total number of docs in the index. Sparsity is mostly going to affect indexing performance and disk space, both will essentially perform as if all documents would have a value. Search performance should be ok as long as the sparsity does make disk usage so high that the filesystem cache cannot do a good job anymore.

Topic		Replies	Views
Is sparse data on not indexed field affect on ES performance? Elasticsearch	3	970	February 6, 2018
In ES >=6.0, is sparsity for doc_values & norms still bad? Elasticsearch	5	766	June 21, 2018
Unindexed fields and sparsity Elasticsearch	1	405	March 8, 2018
"store":true improves aggregation speed. Why? Elasticsearch	2	521	July 5, 2017
Why elasticsearch still check the length of a keyword field, even if it's not indexed? Elasticsearch	7	6123	July 4, 2018

Sparse Documents with "store" fields set to false

Related topics