My question is simply is this data being stored in elasticsearch efficiently such as by a reference in each row to this data so that instead of taking up around 250 bytes on disk it only uses like a couple byte reference to the original data?
It's stored fairly compactly. There are three aspects here, for a set of fields that have default'ish parameters:
Size of the original JSON document (which is returned to the user in _source field). This is stored as a compressed binary blob using LZ4 or DEFLATE
Size of the inverted index for each individual searchable field. Inverted indices are usually very compact. A term dictionary is created which maps all the values of the field to numeric ordinals, and then uses those ordinals to relate which docs have which tokens. Numeric ordinals compress very well, and are much smaller than the original byte strings. The term dictionary itself is compressed into an FST which is also very compact due to overlapping prefix compression.
Size of doc values for each aggregatable field. Doc values are columnar data structures which are used for aggregating, and also compress very well (although they have different characteristics from the inverted index). Numerics use all kinds of tricks like GCD and delta encoding, compressed blocks, etc. Strings are converted to ordinals similar to the term dictionary and the numeric ordinals are compressed like numeric fields.
So the general answer is: yes, compression is pretty good. The amount of compression depends on what fields are present, their cardinality and various options enabled/disabled... but it isn't just naively saving the same value over and over for every document
Ok, thanks for the detailed explanation. I figured that was the case but just wanted to make sure since i'm new to this type of database. Calling them documents is confusing for me because of all the historical baggage going along with that term. I still relate to them as rows with fields.
In terms of relational databases, I guess I'm looking for something like normalization, where a single value is stored as a single row in one table, and then a related table refers to that value by a 1 - 4 byte integer. So you only have N bytes for the value + a max of 4 bytes per row. So it sounds like this is done loosely by what you describe in parameter #2 above, by having a term dictionary, correct?
BTW, for other people's future reference, an FST is the Finite State Transducer algorithm. It took some effort to find this as it's not one of the commonly found meanings of this acronym.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.