Short field type taking 4 bytes instead of 2

Hello,

Looking at the short field type, I see that the Fields created are based on the integer implementation:

However, this means that a IntPoint is used when a short field is indexed. This leads to the short point using 4 bytes per dimension (with only 1 dimension), from what I understand:

There is probably something I am missing, but does that mean that a short field indexes values using 4 bytes per dimension, rather than 2 ? Can you shed some light on this ?

Thanks!

1 Like

The docs state:

As far as integer types ( byte , short , integer and long ) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.

Thanks for pointing this out, i've read it but didn't reflect on it.

Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.

Ok so this means that how a field is stored to disk is handled internally, regardless of the actual type.

This will help indexing and searching be more efficient.

What is the benefit of using short over integer at index/search time, if a short is represented internally as an integer (from what i've understood of the code) ?

correct. See Lucene's packed package: org.apache.lucene.util.packed (Lucene 8.5.2 API)

I'm not exactly sure why the docs are stated that way. The main benefit I see is that the document's data is validated at index time to fit into the given value range (byte, short, ...), which helps the underlying "packing" techniques to work well if the range of actual values is limited (i.e. no outliers).

@jpountz might have more insights here.

1 Like

@ywelsch Thanks for the reference for the packing, that's interesting.

I've omitted by mistake in my post's description a link to BKDWriter, where the bytesPerDim field gets its value from a FieldInfo. I believe that field ultimately comes from IntPoint#getType() which I have referenced earlier. From what I can tell, the value bytesPerDim has an impact on the allocated byte arrays in BKDWriter.

@ywelsch is right, the only benefit of short over integer is input validation.

Having native support for shorts would not help reduce disk space or search-time memory usage. The main benefit is that it would help save some memory in the IndexWriter buffer and thus create new segments a bit less frequently.

I'm rarely seeing shorts in mappings, so given that there are only tiny benefits over integer, I think it's the right trade-off to implement them as integers under the hood. I understand why the documentation might sound surprising once you're familiar with this implementation detail, but in my opinion documenting it would be even more confusing, so I don't dislike it that way.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.