I ran some tests, today, and found something i find rather odd.
I have in an index approximatively 26 000 docs, running in (a quite old,
i'm aware) version 0.19.7
In those docs, among other things, i have two fields :
sortString : it is a string, not analyzed, and containing 9-10 digits (ex :
"999999999")
sortDouble : it is a double (ex : 999 999 999 .0005)
I understood that ES, to be able to sort, will put all thoses value into
the field cache.
So, if the size of such a string in memory is about 56 bytes and if the
size of a double is 8 bytes, I should use a lot less cache when sorting on
the latter.
The thing is that both sort eat a similar amount of cache : 9.3 mb for the
strings , and 9.1 mb for the doubles.
Is that normal? Is there something I did not understand on field cache use?
Any insight on that matter would be very helpful.
Elasticsearch uses Lucene and Lucene uses an inverted index, which is
different from what you are used to in RDBMs. The cache consists of Java
objects holding object references, and object references are almost of
equal size for all types of values in the index. You will see no big
differences if you just take care of the field data type.
For range search, Lucene uses tree-like structures for integers. Dates
are stored as longs. There are also advanced techniques like compressed
bitmaps which effect field caching to save memory.
Jörg
Am 29.03.13 14:23, schrieb DH:
The thing is that both sort eat a similar amount of cache : 9.3 mb for
the strings , and 9.1 mb for the doubles.
Is that normal? Is there something I did not understand on field cache
use?
Wow, that was .. quick !
Thank a lot, Jörg, now, it makes sense.
However, I seem to be getting slightly better response time when sorting on
double .. so, I assume it is better, ressource-wise, to sort on doubles,
rather than on strings (its easier to compare two doubles than to compare
two strings, especially when those strings often differs by their few last
characters).
Am I right?
In those docs, among other things, i have two fields :
sortString : it is a string, not analyzed, and containing 9-10 digits
(ex : "999999999")
sortDouble : it is a double (ex : 999 999 999 .0005)
I understood that ES, to be able to sort, will put all thoses value
into the field cache.
So, if the size of such a string in memory is about 56 bytes and if
the size of a double is 8 bytes, I should use a lot less cache when
sorting on the latter.
The thing is that both sort eat a similar amount of cache : 9.3 mb for
the strings , and 9.1 mb for the doubles.
It depends how many unique values you have. ES builds an array of
unique values, then an array for the docs, with a pointer pointing at
the unique value.
So if every doc has a unique value, then you will see a big difference
in size. If you only have a few unique values, then size will be almost
identical
Thank a lot, Clint
I think I'm getting the hang of this.
So, I have approximatively four times more uniques values for the doubles
than I have for the strings .. however, as doubles are so small compared to
strings, I get approximatively the same cache use from ES with the two.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.