Why is _source filtering faster than stored fields retrieval?

In numerous ES posts, it is stated that retrieving stored fields is slower than filtering values from _source fields. The reason stated is that reading each field requires disk access.disk seeks and stored fields - clinton's reply
AFAIK all fields (including _source) are consecutively stored in the .fdt file. So why should each additional field after the first one require any disk access?
In my data, removing _source and storing each field instead saves 24% disk space. My store will have about 500 billion one liner documents – so saving 24% is serious business .

I do not need highlighting, update, re-indexing and other features that depend on _source.

I'm not an expert on this but I've picked @jpountz's brain, and this is what he has to say:

  • The comment about a separate disk seek is out of date and no longer applies.
  • However, we have three phases (fetch source, fetch stored fields, highlight) all of which require accessing stored fields. Each phase fetches the stored fields (from FS cache so no disk seek) but then has to decompress them... this cost is minimised by just using _source
  • The _source is stored as a binary blob (which is compressed by Lucene with deflate or LZ4) but it can't take advantage of specialised compression for numerics (which probably dominates your data?)

So long story short: if you don't need highlighting, and you don't need the _source for retrieval or reindexing, then yes you can probably rely just on stored fields.

Three more things:

  • Be really careful before deciding you don't need to reindex... If you want to keep your data around for two major versions, you will have to reindex your data, which means you must have the _source. Without it, say bye-bye.
  • Try using index.codec: best_compression (which enables deflate instead of LZ4). It may not get you 24% of savings but it'll get you closer. You can use deflate on your active index then, when it becomes inactive, change to best_compression and force-merge.
  • There is an open issue for changing the way that the _source is stored (basically storing each top level field as a separate stored field) which may allow many use cases to take advantage of the numeric compression. See https://github.com/elastic/elasticsearch/issues/9034
1 Like

Thanks for a complete and swift answer. This forum is fantastic

1 Like

I tested the issue with a database of 100 million lines. the mapping contained:
12 string fields.
3 integer fields
1 date field.
ES 2.3
doc values are disabled as I do not need aggregations.
I built 2 repositories from the same 100 million source file:

  1. standard - with _source, and no stored fields - size 1.88 GB
  2. all fields are stored, no _source field. - size 1.31 GB - 30% less

indexing time for the non _source DB was 15% less than the other one.

I tried large scan queries on both DB, reading up to 10 million results with all the fields. Both DBs performed about the same in those tests.

So - in my situation I guess going without _source fields is the right thing for me.

And again - thanks for the prompt reply.