Term facet memory consideration in the documentation

Hello List,

I am not sure I understand the 'Memory consideration' paragraph at:

http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html

in particular the phrase "Term facet causes the relevant field values to
be loaded into memory. " (maybe because of my bad command of English :wink: )

If I have 100 million documents, each document having a field 'field1'
of type 'byte', and if only 30 different values (or terms) are ever used
across all the documents, will a term facet search on the 'field1' load
into RAM

a) 100 million times one byte, (relevant=all the values)
b) or just 30 times one byte, (relevant=all the distinct values)
c) or just the N<=30 times one byte where N is the number of documents
matching the facet filter? (relevant=all the distinct values that one
can possibly return with the filters)

Also, is there a way to get approximative counts per term, but with less
memory use and/or that could be used when the count per term is really
large?
BTW is 1<<32=4294967296 the maximum count one can get, or does the count
use a float?

Thanks
TuXRaceR

--

IMHO, it's c)

BTW for yor last question, look at: https://github.com/ptdavteam/elasticsearch-approx-plugin

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 10 janv. 2013 à 23:58, TuX RaceR tuxracer69@gmail.com a écrit :

Hello List,

I am not sure I understand the 'Memory consideration' paragraph at:

http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html

in particular the phrase "Term facet causes the relevant field values to be loaded into memory. " (maybe because of my bad command of English :wink: )

If I have 100 million documents, each document having a field 'field1' of type 'byte', and if only 30 different values (or terms) are ever used across all the documents, will a term facet search on the 'field1' load into RAM

a) 100 million times one byte, (relevant=all the values)
b) or just 30 times one byte, (relevant=all the distinct values)
c) or just the N<=30 times one byte where N is the number of documents matching the facet filter? (relevant=all the distinct values that one can possibly return with the filters)

Also, is there a way to get approximative counts per term, but with less memory use and/or that could be used when the count per term is really large?
BTW is 1<<32=4294967296 the maximum count one can get, or does the count use a float?

Thanks
TuXRaceR

--

--

Thank you David,

actually after reading

http://elasticsearch-users.115913.n3.nabble.com/terms-facet-explodes-memory-td3258748.html

I would exclude c)

Thank you for the very interesting link

I am not sure if you can combine this plugin with a facet 'filter'. I.e
not do only a count by date, but do a count by date for documents
matching a condition (e.g documents belonging to a user)

Thanks
TuXRaceR

On 01/11/2013 02:23 AM, David Pilato wrote:

IMHO, it's c)

BTW for yor last question, look at:
GitHub - pearson-enabling-technologies/elasticsearch-approx-plugin: Plugin for ElasticSearch to do approximate or exact distinct counts, and fast term lists

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 10 janv. 2013 à 23:58, TuX RaceR <tuxracer69@gmail.com
mailto:tuxracer69@gmail.com> a écrit :

Hello List,

I am not sure I understand the 'Memory consideration' paragraph at:

Elasticsearch Platform — Find real-time answers at scale | Elastic

in particular the phrase "Term facet causes the relevant field values
to be loaded into memory. " (maybe because of my bad command of
English :wink: )

If I have 100 million documents, each document having a field 'field1'
of type 'byte', and if only 30 different values (or terms) are ever
used across all the documents, will a term facet search on the
'field1' load into RAM

a) 100 million times one byte, (relevant=all the values)
b) or just 30 times one byte, (relevant=all the distinct values)
c) or just the N<=30 times one byte where N is the number of documents
matching the facet filter? (relevant=all the distinct values that one
can possibly return with the filters)

Also, is there a way to get approximative counts per term, but with
less memory use and/or that could be used when the count per term is
really large?
BTW is 1<<32=4294967296 the maximum count one can get, or does the
count use a float?

Thanks
TuXRaceR

--

--

--