Inconsistent facet memory usage

Executing a statistical facet on a long field on a clean index with 10K
items uses 120KB in the field cache, e.g. this gisthttps://gist.github.com/3015553.
That's 12 bytes per long, which seems great.

I have another index with 3.5K items. Doing a statistical facet on a long
field in it uses 1.5mb in the field cache. That's 400 bytes per long, which
seems bad.

They should both be single-valued fields--I'm not getting why it's going so
crazy with memory usage. One possible difference is that the other index
has had lots of inserts, updates, and deletes, if that might affect things.
I've done an optimize on the index, no change on memory usage. What factors
could influence memory usage? Both indexes are on the same ES node with
stock settings for number of shards.

I took some heap dumps and it looks like the difference is that the bad
scenario index has many, many docs in it (not all of the type being
searched) and so the ordinals array is huge.

Good scenario: the SingleValueLongFieldData instances have valuesCache
array about the same size as the ordinals array.
Bad scenario: the SingleValueLongFieldData instances have valuesCache array
of size 21 and an ordinals array of size 45651.

I posted a revised gist https://gist.github.com/3018581 that sets up this
scenario and indeed, the field cache usage to do the same operation is
orders of magnitude more memory intensive.

Is there some way to mitigate this? Can the ordinals array be shared across
multiple FieldDatas? What really screws us is that each facet seems to
build its own ordinals array and retain it in memory, so we're paying 1.5Mb
in per 200KB in field data.

On Thursday, 28 June 2012 23:58:11 UTC-4, Colin Dellow wrote:

Executing a statistical facet on a long field on a clean index with 10K
items uses 120KB in the field cache, e.g. this gisthttps://gist.github.com/3015553.
That's 12 bytes per long, which seems great.

I have another index with 3.5K items. Doing a statistical facet on a long
field in it uses 1.5mb in the field cache. That's 400 bytes per long, which
seems bad.

They should both be single-valued fields--I'm not getting why it's going
so crazy with memory usage. One possible difference is that the other
index has had lots of inserts, updates, and deletes, if that might affect
things. I've done an optimize on the index, no change on memory usage. What
factors could influence memory usage? Both indexes are on the same ES node
with stock settings for number of shards.

Oh, duh -- the ordinals can't be shared across the field datas, but maybe a
mapping from doc Id -> index into a more compact array (of size the # of
docs of that type, rather than of size the # of docs total) could be shared?

On Friday, 29 June 2012 11:25:26 UTC-4, Colin Dellow wrote:

I took some heap dumps and it looks like the difference is that the bad
scenario index has many, many docs in it (not all of the type being
searched) and so the ordinals array is huge.

Good scenario: the SingleValueLongFieldData instances have valuesCache
array about the same size as the ordinals array.
Bad scenario: the SingleValueLongFieldData instances have valuesCache
array of size 21 and an ordinals array of size 45651.

I posted a revised gist https://gist.github.com/3018581 that sets up
this scenario and indeed, the field cache usage to do the same operation is
orders of magnitude more memory intensive.

Is there some way to mitigate this? Can the ordinals array be shared
across multiple FieldDatas? What really screws us is that each facet seems
to build its own ordinals array and retain it in memory, so we're paying
1.5Mb in per 200KB in field data.

On Thursday, 28 June 2012 23:58:11 UTC-4, Colin Dellow wrote:

Executing a statistical facet on a long field on a clean index with 10K
items uses 120KB in the field cache, e.g. this gisthttps://gist.github.com/3015553.
That's 12 bytes per long, which seems great.

I have another index with 3.5K items. Doing a statistical facet on a long
field in it uses 1.5mb in the field cache. That's 400 bytes per long, which
seems bad.

They should both be single-valued fields--I'm not getting why it's going
so crazy with memory usage. One possible difference is that the other
index has had lots of inserts, updates, and deletes, if that might affect
things. I've done an optimize on the index, no change on memory usage. What
factors could influence memory usage? Both indexes are on the same ES node
with stock settings for number of shards.