Numeric_range and memory requirements


(andym) #1

Hello,

I am trying to do a back-of-envelope calculations of memory
requirements to be certain we can perform some of our queries once our
data set is fully populated and want to make sure my calculations are
approximately correct.

Say as an example I have 210^9 documents with two fields (“text” and
“income”) one being a string and another being an integer (assume
80,000 “unique” integers) and I want to perform numeric_range DSL
query on these docs (e.g. find docs with text=”bob” and
income>30000) . If I read the documentation correctly, for the query
ES will create 1 bit vector of length 8
10^4 bits or 110^4 bytes per
document (which will indicate whether the value of unique integer is
present or not in the document). Thus this will require total of
2
10^9 * 110^4 == 210^13 (bytes) == 20T of memory.

This seems to be pretty big memory requirement -- can you tell me
whether this calculation is correct. Also would the calculation be
different if “range” were to be used instead of “numeric_range”

Thanks!

-- Andy


(Paul Smith) #2

Say as an example I have 210^9 documents with two fields (“text” and
“income”) one being a string and another being an integer (assume
80,000 “unique” integers) and I want to perform numeric_range DSL
query on these docs (e.g. find docs with text=”bob” and
income>30000) . If I read the documentation correctly, for the query
ES will create 1 bit vector of length 8
10^4 bits or 110^4 bytes per
document (which will indicate whether the value of unique integer is
present or not in the document). Thus this will require total of
2
10^9 * 110^4 == 210^13 (bytes) == 20T of memory.

This seems to be pretty big memory requirement -- can you tell me
whether this calculation is correct. Also would the calculation be
different if “range” were to be used instead of “numeric_range”

where in the documentation are you reading that it does it this way?
there's no way it's 10^4 per_document.


(andym) #3

Paul,
Please take a look right here: http://www.elasticsearch.org/guide/reference/query-dsl/numeric-range-filter.html
“The numeric range filter works by loading all the relevant field
values into memory, and checking for the relevant docs if they satisfy
the range requirements.”

This implies that it uses field cache (because only “field cache”
appears to load all values into memory). However it is unclear from
this documentation what the “relevant field values” are and what their
sizes will be once loaded. Some more light is shed here:

http://elasticsearch-users.115913.n3.nabble.com/Just-Pushed-Numeric-Range-Filter-td1715331.html
“The numeric_range filter uses the the field data cache in order to do
the filtering.”
“The field data cache basically uninverts the index, and stores the
value(s) of a field indexed by doc id.”

Therefore if I have doc1: { id:1, text:”ann”, income:10, group:
[“Retired”, “Male”] } and doc2: { id:2, text:”ann”, income:5, group:
[“Teacher”, “Female”] } and group is “untokenized” field, from the
explanation I can have many interpretations of what field cache looks
like for “group” field and how much memory it takes. One can be like
this:

Doc1 { Retired:0 Male:1, Teacher:0, Female:0 } (size 4 bits)

Doc2 { Retired:0 Male:0, Teacher:1, Female:1 } (size 4 bits)

Extend the analogy to numeric case with unique 8*10^4 numbers and you
have answer to your question as to how I came up with the
interpretation of the documentation.

After I made the post I located another post which sheds the light on
what is the expect from field cache:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/200911.mbox/<012601ca5c2d$42818a70$c7849f50$@ca>

[maxdoc] x [8 bytes] per “non-tokenized” field which for my question
would yield a requirement of 210^98=16G

However the above answer is not satisfying as if I were to have non-
tokenized “string” field -- I am not sure how 8 bytes are going to
cope with loading the whole field (likely greater than 8 bytes) into
RAM memory. If you know the answer, I’d love you share it.

What I am trying to understand at this point is what are the RAM
requirements if I know number of documents, number of fields, their
types (string/integer, analyzed/not-analyzed) and queries that I will
be making against these fields. Looking at the answers to similar
questions, they are mostly “it depends,” yet I feel it can be
estimated within a range if I know how much memory is needed per field
type.

On Mar 1, 6:42 pm, Paul Smith tallpsm...@gmail.com wrote:

Say as an example I have 210^9 documents with two fields (“text” and
“income”) one being a string and another being an integer (assume
80,000 “unique” integers) and I want to perform numeric_range DSL
query on these docs (e.g. find docs with text=”bob” and
income>30000) . If I read the documentation correctly, for the query
ES will create 1 bit vector of length 8
10^4 bits or 110^4 bytes per
document (which will indicate whether the value of unique integer is
present or not in the document). Thus this will require total of
2
10^9 * 110^4 == 210^13 (bytes) == 20T of memory.

This seems to be pretty big memory requirement -- can you tell me
whether this calculation is correct. Also would the calculation be
different if “range” were to be used instead of “numeric_range”

where in the documentation are you reading that it does it this way?
there's no way it's 10^4 per_document.


(Paul Smith) #4

Extend the analogy to numeric case with unique 8*10^4 numbers and you
have answer to your question as to how I came up with the

For ints, as I understand it though, it doesn't store the unique string
value of the number, but instead uses actually stores different precisions
of the number in multiple fields. Best way to explain this is probably to
Google "TrieRangeQuery" which was introduced to Lucene by Uwe Schindler way
back, there's a good explanation out there how for a given int field "x"
Lucene will then store "X_precision1", "X_precicion2" or whatever with
different granularities of the number. This way the number of unique terms
is kept very very low. Bitmasking for searching is done during the numeric
range queries to do special tricks to find all the low precision points
covering the bulk of the range, and then masking out the end bits you don't
need (this is a gross simplification and may not actually be quite that
way, but hopefully highlighting how the different precision values stored
will end up giving one fast range queries with relatively small # unique
terms).

So, while I can't quite point my finger exactly, I'm fairly confident the #
unique terms required will be vastly lower than the 20Tb, the 16Gb seems
more reasonable. Shay or others will hopefully chime in here with some
more specifics.

What I am trying to understand at this point is what are the RAM

requirements if I know number of documents, number of fields, their
types (string/integer, analyzed/not-analyzed) and queries that I will
be making against these fields. Looking at the answers to similar
questions, they are mostly “it depends,” yet I feel it can be
estimated within a range if I know how much memory is needed per field
type.

Wouldn't be hard to create a quick simple index to index 200 million items
with just the id's and do a numeric range query and profile the memory
output using jmap -heap:histo or something to see the allocation. While
Numeric Range stuff does use more memory than standard range queries, I
don't think it's anywhere near the 20Tb. Range Queries simply need to scan
to the first term (and do this efficiently because the Term readers only
have every Xth term position in memory, not all terms are loaded into
memory, so they jump there very quickly), then add all the bits from the
term masks up to the end range term, so there's not so much RAM used for
Range queries at all, the term readers already have the terms they need
loaded into memory for efficiency anyway.

Paul


(Shay Banon) #5

The numeric range query loads all the unique values of the field to memory, so, if you have 2 docs, each with an income, then the values of that income are loaded to memory. There is an additional "ordinals" array that is loaded to map from docId to the relevant value. But, thats the case for numeric_range filter, which I am not sure that you really need. Why not use the regular range filter? Especially with income, and filter caching, it can probably cache pretty nicely the most common checks you do (those are usually, I assume, though might be wrong, using the same value).

On Friday, March 2, 2012 at 4:06 AM, Paul Smith wrote:

Extend the analogy to numeric case with unique 8*10^4 numbers and you
have answer to your question as to how I came up with the

For ints, as I understand it though, it doesn't store the unique string value of the number, but instead uses actually stores different precisions of the number in multiple fields. Best way to explain this is probably to Google "TrieRangeQuery" which was introduced to Lucene by Uwe Schindler way back, there's a good explanation out there how for a given int field "x" Lucene will then store "X_precision1", "X_precicion2" or whatever with different granularities of the number. This way the number of unique terms is kept very very low. Bitmasking for searching is done during the numeric range queries to do special tricks to find all the low precision points covering the bulk of the range, and then masking out the end bits you don't need (this is a gross simplification and may not actually be quite that way, but hopefully highlighting how the different precision values stored will end up giving one fast range queries with relatively small # unique terms).

So, while I can't quite point my finger exactly, I'm fairly confident the # unique terms required will be vastly lower than the 20Tb, the 16Gb seems more reasonable. Shay or others will hopefully chime in here with some more specifics.

What I am trying to understand at this point is what are the RAM
requirements if I know number of documents, number of fields, their
types (string/integer, analyzed/not-analyzed) and queries that I will
be making against these fields. Looking at the answers to similar
questions, they are mostly “it depends,” yet I feel it can be
estimated within a range if I know how much memory is needed per field
type.

Wouldn't be hard to create a quick simple index to index 200 million items with just the id's and do a numeric range query and profile the memory output using jmap -heap:histo or something to see the allocation. While Numeric Range stuff does use more memory than standard range queries, I don't think it's anywhere near the 20Tb. Range Queries simply need to scan to the first term (and do this efficiently because the Term readers only have every Xth term position in memory, not all terms are loaded into memory, so they jump there very quickly), then add all the bits from the term masks up to the end range term, so there's not so much RAM used for Range queries at all, the term readers already have the terms they need loaded into memory for efficiency anyway.

Paul


(system) #6