Paul,
Please take a look right here: Elasticsearch Platform — Find real-time answers at scale | Elastic
“The numeric range filter works by loading all the relevant field
values into memory, and checking for the relevant docs if they satisfy
the range requirements.”
This implies that it uses field cache (because only “field cache”
appears to load all values into memory). However it is unclear from
this documentation what the “relevant field values” are and what their
sizes will be once loaded. Some more light is shed here:
http://elasticsearch-users.115913.n3.nabble.com/Just-Pushed-Numeric-Range-Filter-td1715331.html
“The numeric_range filter uses the the field data cache in order to do
the filtering.”
“The field data cache basically uninverts the index, and stores the
value(s) of a field indexed by doc id.”
Therefore if I have doc1: { id:1, text:”ann”, income:10, group:
[“Retired”, “Male”] } and doc2: { id:2, text:”ann”, income:5, group:
[“Teacher”, “Female”] } and group is “untokenized” field, from the
explanation I can have many interpretations of what field cache looks
like for “group” field and how much memory it takes. One can be like
this:
Doc1 { Retired:0 Male:1, Teacher:0, Female:0 } (size 4 bits)
Doc2 { Retired:0 Male:0, Teacher:1, Female:1 } (size 4 bits)
Extend the analogy to numeric case with unique 8*10^4 numbers and you
have answer to your question as to how I came up with the
interpretation of the documentation.
After I made the post I located another post which sheds the light on
what is the expect from field cache:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/200911.mbox/<012601ca5c2d$42818a70$c7849f50$@ca>
[maxdoc] x [8 bytes] per “non-tokenized” field which for my question
would yield a requirement of 210^98=16G
However the above answer is not satisfying as if I were to have non-
tokenized “string” field -- I am not sure how 8 bytes are going to
cope with loading the whole field (likely greater than 8 bytes) into
RAM memory. If you know the answer, I’d love you share it.
What I am trying to understand at this point is what are the RAM
requirements if I know number of documents, number of fields, their
types (string/integer, analyzed/not-analyzed) and queries that I will
be making against these fields. Looking at the answers to similar
questions, they are mostly “it depends,” yet I feel it can be
estimated within a range if I know how much memory is needed per field
type.
On Mar 1, 6:42 pm, Paul Smith tallpsm...@gmail.com wrote:
Say as an example I have 210^9 documents with two fields (“text” and
“income”) one being a string and another being an integer (assume
80,000 “unique” integers) and I want to perform numeric_range DSL
query on these docs (e.g. find docs with text=”bob” and
income>30000) . If I read the documentation correctly, for the query
ES will create 1 bit vector of length 810^4 bits or 110^4 bytes per
document (which will indicate whether the value of unique integer is
present or not in the document). Thus this will require total of
210^9 * 110^4 == 210^13 (bytes) == 20T of memory.
This seems to be pretty big memory requirement -- can you tell me
whether this calculation is correct. Also would the calculation be
different if “range” were to be used instead of “numeric_range”
where in the documentation are you reading that it does it this way?
there's no way it's 10^4 per_document.