So I have a test index where I define a field1 as type: long. I then
define field2 as type:integer. I index 2 documents, each with array of
1000 numbers that are 9 digits in length. In the first document, the array
is put in the field1. In the second document, the array is put in the
field2.
I then go to _stats and look at the document sizes.
I would have expected the document size of the one with the type:integer to
be half the size of the one with type:long as the integer is a 32 bit type,
and long is a 64 bit type. But both documents are almost exactly the same
size. And from my math (each document is around 9K, and 64bit =
8bytes*9000 characters = 7.2K), it seems that they are all being indexed as
long, 64bit.
I double checked the _mapping, and the fields are definitely set as long &
integer respectively. Any idea why the document indexed with field
type:integer wouldn't be far less in size than the one with type:long?
Note, Lucene is an inverted index, it is not behaving like a bag of
documents of primitive data types. In spite there are field types like
LongField, IntField, DoubleField, FloatField for numerics, this does not
determine the overall size of the index files. To simplify, imagine a
list of pointers pointing to longs, and a list of pointers pointing to
ints. These posting list elements uses the same memory size, no matter
what kind of fields you have in a document.
Lucene doesn't know about types under the hood. We index numeric types as
prefix coded tries to make range queries efficient. The number of bytes a
long / int value takes in the index is depending on the precision_step that
is used. but that is the data in the term index. if you are curious about
the stored document size, we only know about String / UTF-8 Bytes so we
don't store this in the most efficient way a Database would do in a
dedicated column. I don't think you can compare the index size to ensure
that the right type is applied, I am afraid!
simon
On Thursday, February 14, 2013 2:25:21 AM UTC+1, ryano wrote:
So I have a test index where I define a field1 as type: long. I then
define field2 as type:integer. I index 2 documents, each with array of
1000 numbers that are 9 digits in length. In the first document, the array
is put in the field1. In the second document, the array is put in the
field2.
I then go to _stats and look at the document sizes.
I would have expected the document size of the one with the type:integer
to be half the size of the one with type:long as the integer is a 32 bit
type, and long is a 64 bit type. But both documents are almost exactly the
same size. And from my math (each document is around 9K, and 64bit =
8bytes*9000 characters = 7.2K), it seems that they are all being indexed as
long, 64bit.
I double checked the _mapping, and the fields are definitely set as long &
integer respectively. Any idea why the document indexed with field
type:integer wouldn't be far less in size than the one with type:long?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.