I get the impression that using the 'long' type instead of 'integer' would
use more disk space and degrade search performance (similary for double
instead of float), but there's nothing in the documentation to back this
impression up.
There must be an advantage to using integer (if you can) because otherwise
it wouldn't exist. It just doesn't say what the advantage is.
Can someone confirm? Even better does anyone have any stats on what
difference it would make?
Lucene only knows how to index text strings. For numeric types, they are
stored as tries. Tries work on variable length. So only the API is
different to convert integer or long to tries. Tries are the basis for
numeric range searches.
It is a myth that long take more disk space than ints in an inverted index
like Lucene. Both long and integer (numeric types) take a bit more space
than text strings, but for large indices, this does not add up at all, it
is in the noise.
For field caches/filters, and doc values, the difference of integer and
long is more important. But there are other aspects like field cardinality
which determine the overall storage volume required.
I get the impression that using the 'long' type instead of 'integer' would
use more disk space and degrade search performance (similary for double
instead of float), but there's nothing in the documentation to back this
impression up.
There must be an advantage to using integer (if you can) because otherwise
it wouldn't exist. It just doesn't say what the advantage is.
Can someone confirm? Even better does anyone have any stats on what
difference it would make?
So let's assume the cardinality is the same. Let's assume I have no text, I
only index numeric fields.
If I've got a range of data, that would all fit within the bounds of an
integer, is there any reason not to index it as a long? Are there any down
sides? It sounds like you're saying that there isn't?
On Friday, October 24, 2014 2:55:24 PM UTC+1, Jörg Prante wrote:
Lucene only knows how to index text strings. For numeric types, they are
stored as tries. Tries work on variable length. So only the API is
different to convert integer or long to tries. Tries are the basis for
numeric range searches.
It is a myth that long take more disk space than ints in an inverted index
like Lucene. Both long and integer (numeric types) take a bit more space
than text strings, but for large indices, this does not add up at all, it
is in the noise.
For field caches/filters, and doc values, the difference of integer and
long is more important. But there are other aspects like field cardinality
which determine the overall storage volume required.
Jörg
On Fri, Oct 24, 2014 at 3:22 PM, Tim S <tims...@gmail.com <javascript:>>
wrote:
I get the impression that using the 'long' type instead of 'integer'
would use more disk space and degrade search performance (similary for
double instead of float), but there's nothing in the documentation to back
this impression up.
There must be an advantage to using integer (if you can) because
otherwise it wouldn't exist. It just doesn't say what the advantage is.
Can someone confirm? Even better does anyone have any stats on what
difference it would make?
It depends what you do with ints. Your question was about disk storage.
Ints are much faster when they are loaded into cache: they save 50% memory,
they can be used as index in array for sorting, loading/storing by CPU
instruction takes only one cycle etc.
So let's assume the cardinality is the same. Let's assume I have no text,
I only index numeric fields.
If I've got a range of data, that would all fit within the bounds of an
integer, is there any reason not to index it as a long? Are there any down
sides? It sounds like you're saying that there isn't?
On Friday, October 24, 2014 2:55:24 PM UTC+1, Jörg Prante wrote:
Lucene only knows how to index text strings. For numeric types, they are
stored as tries. Tries work on variable length. So only the API is
different to convert integer or long to tries. Tries are the basis for
numeric range searches.
It is a myth that long take more disk space than ints in an inverted
index like Lucene. Both long and integer (numeric types) take a bit more
space than text strings, but for large indices, this does not add up at
all, it is in the noise.
For field caches/filters, and doc values, the difference of integer and
long is more important. But there are other aspects like field cardinality
which determine the overall storage volume required.
I get the impression that using the 'long' type instead of 'integer'
would use more disk space and degrade search performance (similary for
double instead of float), but there's nothing in the documentation to back
this impression up.
There must be an advantage to using integer (if you can) because
otherwise it wouldn't exist. It just doesn't say what the advantage is.
Can someone confirm? Even better does anyone have any stats on what
difference it would make?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.