Short field type taking 4 bytes instead of 2

yfful · July 3, 2020, 1:11pm

Hello,

Looking at the short field type, I see that the Fields created are based on the integer implementation:

elastic/elasticsearch/blob/4366360895dbcd28bf993000b80c95f83ecb79a5/server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java#L573


    @Override
    public Query rangeQuery(String field, Object lowerTerm, Object upperTerm,
                            boolean includeLower, boolean includeUpper,
                            boolean hasDocValues, QueryShardContext context) {
        return INTEGER.rangeQuery(field, lowerTerm, upperTerm, includeLower, includeUpper, hasDocValues, context);
    }

    @Override
    public List<Field> createFields(String name, Number value,
                                    boolean indexed, boolean docValued, boolean stored) {
        return INTEGER.createFields(name, value, indexed, docValued, stored);
    }

    @Override
    Number valueForSearch(Number value) {
        return value.shortValue();
    }
},
INTEGER("integer", NumericType.INT) {
    @Override
    public Integer parse(Object value, boolean coerce) {

However, this means that a IntPoint is used when a short field is indexed. This leads to the short point using 4 bytes per dimension (with only 1 dimension), from what I understand:

github.com

apache/lucene-solr/blob/05324e7b1813c43084fbce7f3e6305db0ac94c32/lucene/core/src/java/org/apache/lucene/document/IntPoint.java#L49


*   <li>{@link #newExactQuery(String, int)} for matching an exact 1D point.
*   <li>{@link #newSetQuery(String, int...)} for matching a set of 1D values.
*   <li>{@link #newRangeQuery(String, int, int)} for matching a 1D range.
*   <li>{@link #newRangeQuery(String, int[], int[])} for matching points/ranges in n-dimensional space.
* </ul>
* @see PointValues
*/
public final class IntPoint extends Field {
 private static FieldType getType(int numDims) {
   FieldType type = new FieldType();
   type.setDimensions(numDims, Integer.BYTES);
   type.freeze();
   return type;
 }

 @Override
 public void setIntValue(int value) {
   setIntValues(value);
 }

 /** Change the values of this field */

There is probably something I am missing, but does that mean that a short field indexes values using 4 bytes per dimension, rather than 2 ? Can you shed some light on this ?

Thanks!

ywelsch · July 3, 2020, 1:34pm

The docs state:

As far as integer types ( byte , short , integer and long ) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.

yfful · July 3, 2020, 1:42pm

Thanks for pointing this out, i've read it but didn't reflect on it.

Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.

Ok so this means that how a field is stored to disk is handled internally, regardless of the actual type.

This will help indexing and searching be more efficient.

What is the benefit of using short over integer at index/search time, if a short is represented internally as an integer (from what i've understood of the code) ?

ywelsch · July 3, 2020, 2:06pm

correct. See Lucene's packed package: org.apache.lucene.util.packed (Lucene 8.5.2 API)

I'm not exactly sure why the docs are stated that way. The main benefit I see is that the document's data is validated at index time to fit into the given value range (byte, short, ...), which helps the underlying "packing" techniques to work well if the range of actual values is limited (i.e. no outliers).

@jpountz might have more insights here.

yfful · July 3, 2020, 3:31pm

@ywelsch Thanks for the reference for the packing, that's interesting.

I've omitted by mistake in my post's description a link to BKDWriter, where the bytesPerDim field gets its value from a FieldInfo. I believe that field ultimately comes from IntPoint#getType() which I have referenced earlier. From what I can tell, the value bytesPerDim has an impact on the allocated byte arrays in BKDWriter.

github.com

apache/lucene-solr/blob/05324e7b1813c43084fbce7f3e6305db0ac94c32/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java#L165


public BKDWriter(int maxDoc, Directory tempDir, String tempFileNamePrefix, int numDataDims, int numIndexDims, int bytesPerDim,
                    int maxPointsInLeafNode, double maxMBSortInHeap, long totalPointCount) throws IOException {
  verifyParams(numDataDims, numIndexDims, maxPointsInLeafNode, maxMBSortInHeap, totalPointCount);
  // We use tracking dir to deal with removing files on exception, so each place that
  // creates temp files doesn't need crazy try/finally/sucess logic:
  this.tempDir = new TrackingDirectoryWrapper(tempDir);
  this.tempFileNamePrefix = tempFileNamePrefix;
  this.maxPointsInLeafNode = maxPointsInLeafNode;
  this.numDataDims = numDataDims;
  this.numIndexDims = numIndexDims;
  this.bytesPerDim = bytesPerDim;
  this.totalPointCount = totalPointCount;
  this.maxDoc = maxDoc;
  docsSeen = new FixedBitSet(maxDoc);
  packedBytesLength = numDataDims * bytesPerDim;
  packedIndexBytesLength = numIndexDims * bytesPerDim;

  scratchDiff = new byte[bytesPerDim];
  scratch1 = new byte[packedBytesLength];
  scratch2 = new byte[packedBytesLength];
  commonPrefixLengths = new int[numDataDims];

jpountz · July 6, 2020, 8:04am

@ywelsch is right, the only benefit of short over integer is input validation.

Having native support for shorts would not help reduce disk space or search-time memory usage. The main benefit is that it would help save some memory in the IndexWriter buffer and thus create new segments a bit less frequently.

I'm rarely seeing shorts in mappings, so given that there are only tiny benefits over integer, I think it's the right trade-off to implement them as integers under the hood. I understand why the documentation might sound surprising once you're familiar with this implementation detail, but in my opinion documenting it would be even more confusing, so I don't dislike it that way.

system · August 3, 2020, 8:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Comparison of int/long and float/double types Elasticsearch	4	5416	July 6, 2017
ES seems to be aliasing the byte type to the short type Elasticsearch	5	465	July 6, 2017
Reduce size of index which contains integer fields only Elasticsearch	4	1224	July 5, 2017
What type to choose to index a field that has only few possible values (more than 'boolean' but less than 'short') Elasticsearch	5	366	July 6, 2017
Integer size vs Long size Elasticsearch	3	3652	July 6, 2017

Short field type taking 4 bytes instead of 2

Related topics