What's the memory impact of using _source? If my ES node has a shard loaded in RAM, does that shard pull into RAM all _source fields it holds? Or will the _source field be kept on disk until the get occurs?
How does the inverted index relate to the _source field? Non-issue?
Since _source appears to be stored as JSON, how would binary document formats be handled when storing _source? Converted to JSON? Stored as some sort of blob in the JSON body (encoded or some other way)?
Just curious here as I'm considering the impact of making this change across all of the customer use cases I have, from document heavy (PDF, Word, Excel, etc.) to memory constrained.
The _source from multiple documents are chunked together and stored compressed. When you want to load _source for some document the chunk has to be streamed through memory until we get to the document that you want. For a search, by default, we do this only for the hits returned. All the bytes for all the _source are kept in memory and returned on the response. If you want you can disable this on search but still store the _source in case you need it later. That is fine.
Sometimes we have to do more then just shuffle the bits around, like during highlighting and source filtering. But not always.
We support storing json, yaml, cbor and smile. Personally I'd usually skip putting binary data into Elasticsearch.
How would I disable returning _source on search as you mention? I probably want to test both ways but I imagine turning that off behind a flag would be most useful for some of my memory constrained environments.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.