Is there still an impact on performance with sparse documents when store is set to false for all fields (the default)? Based on my reading it would still affect the inverted index in Lucene and also increase the doc value size in Elasticsearch (we're using 2.3).
I'm trying to determine whether to have multiple indexes versus a single index when having multiple types with different fields. I understand in practice it's best to have an index per type, but would like some more concrete evidence to satisfy my customer's curiosity.
Is there a way to measure the sparsity of documents in Elasticsearch or Lucene?
Actually stored fields (the feature that us used when you set store=true on a field) work rather well with sparsity. Sparsity is rather an issue with norms (used by analyzed string fields by default) and doc values (used by almost all fields but analyzed string fields by default). The reason is that these two data structures use a dense encoding where documents that do not have a value are going to use the same fixed amount of disk space as documents that have a value. This can be especially an issue with features that create sparsity implicitly like types and the nested type.
Do you know if there is any way to measure the sparsity of fields and the affects on performance? I'm looking for some sort of threshold that should lead us to move to multiple indexes. I took the Elasticsearch developer training a few weeks ago and the instructors mentioned 1700 fields crashing Lucene indexes, but anything over 1000 was bad. Was hoping for some more concrete documentation or metrics to use to base our decision.
As usual it is hard to give any hard numbers. You can measure the sparsity by comparing the result of an exists query on a field with the total number of docs in the index. Sparsity is mostly going to affect indexing performance and disk space, both will essentially perform as if all documents would have a value. Search performance should be ok as long as the sparsity does make disk usage so high that the filesystem cache cannot do a good job anymore.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.