More facet memory reduction questions

I'm trying to get over 1 billion documents loaded on 10 16G machines with a
faceted "tags" array field with about 8000 unique values. Looking at
elastichead, each document is around 800 bytes after compression.

So far to reduce memory I've

  • switched from strings to shorts for the tags
  • turned on source compression
  • switched from 60 shards to 20 shards (maybe I need to go to 10?)
  • set "index.cache.field.type: soft" although I'm not sure what that does
    if anything

Suggestions on what to do next?

  • For the obvious hardware change - if the choice was 20 16G machines or 10
    32G machine is there a clear winner?
  • Would it help if I split the single short field with 8000 values into 32
    1 byte fields?
  • Related, we could move some of the tags out to separate single item
    fields, would that help? (For example we have tags for every country
    instead of just a country field.)
  • Wait for 0.20? :slight_smile:

Thanks,
Andy

Also would splitting into multiple indexes and using a index alias to
search help?

Splitting into multiple indices will not really change anything,
effectively a shard is whats important, and it does not matter if you have
1 index with 20 shards or 10 indices with 1 shard.

What is the memory usage that you see now? What does the field cache stats
reporting? Are you running into memory problems?

On Thu, May 3, 2012 at 6:02 PM, Andy Wick andywick@gmail.com wrote:

Also would splitting into multiple indexes and using a index alias to
search help?

Using ES_HEAP_SIZE=13G on the 10x16G machines I'm able to get to about 600M
documents before I start getting OOM errors. "loading field [ta] caused out
of memory failure"

Here is the largest field size I see.

    "cache" : {
      "field_evictions" : 582,
      "field_size" : "12.9gb",
      "field_size_in_bytes" : 13868914306,
      "filter_count" : 3,
      "filter_evictions" : 0,
      "filter_size" : "7.6mb",
      "filter_size_in_bytes" : 8006872
    },

Here is my sample query, which is using a range filter to limit the
response to around 1 million documents. Should I use a range query instead
of a filter?

{"fields":["","","fp","lp","a1","a2","p1","p2","pa","by","no","us","ro"],"from":"0","size":100,"sort":{"lp":{"order":"desc"}},"facets":{"ta":{"terms":{"field":"ta","size":1000}}},"query":{"filtered":{"query":{"match_all":{}},"filter":{"and":[{"numeric_range":{"lp":{"from":1336138259}}}]}}}}

Changing the way you query the data does not matter. Are you faceting on
other fields except for ta? Is ta the tags field? What is the maximum
number of values a single tags field in a doc can have?

On Fri, May 4, 2012 at 5:39 PM, Andy Wick andywick@gmail.com wrote:

Using ES_HEAP_SIZE=13G on the 10x16G machines I'm able to get to about
600M documents before I start getting OOM errors. "loading field [ta]
caused out of memory failure"

Here is the largest field size I see.

    "cache" : {
      "field_evictions" : 582,
      "field_size" : "12.9gb",
      "field_size_in_bytes" : 13868914306,
      "filter_count" : 3,
      "filter_evictions" : 0,
      "filter_size" : "7.6mb",
      "filter_size_in_bytes" : 8006872
    },

Here is my sample query, which is using a range filter to limit the
response to around 1 million documents. Should I use a range query instead
of a filter?

{"fields":["","","fp","lp","a1","a2","p1","p2","pa","by","no","us","ro"],"from":"0","size":100,"sort":{"lp":{"order":"desc"}},"facets":{"ta":{"terms":{"field":"ta","size":1000}}},"query":{"filtered":{"query":{"match_all":{}},"filter":{"and":[{"numeric_range":{"lp":{"from":1336138259}}}]}}}}

Yes, ta field is the tags field and the only field I facet on. There are
about 8000 unique values possible in the tags array, I've never looked at
what the min/average/max size of the tags array is. (Is there a query to
find out?)

Guessing - min is 2, max is 300, average about 30.

If needed I can slice up the tags field into multiple fields.

Thanks,
Andy

Would the number of shards per node matter? You [Andy] mentioned the
number of shards, but not the number of replicas. If each node had
only half of the data instead of being full replicas, wouldn't the
memory for the field cache be reduced on each node? I have never
tested this hypothesis, but was about to test it myself.

--
Ivan

On Fri, May 4, 2012 at 6:29 AM, Shay Banon kimchy@gmail.com wrote:

Splitting into multiple indices will not really change anything, effectively
a shard is whats important, and it does not matter if you have 1 index with
20 shards or 10 indices with 1 shard.

What is the memory usage that you see now? What does the field cache stats
reporting? Are you running into memory problems?

I should have mentioned that two other things I did with the hopes of
reducing memory was to turn off replicates for this index and to disable
the all field.

Thanks,
Andy

Fields that have multi data values can contribute greatly to the memory
used. There is nothing really to do about it in terms of improving it,
except for scaling out or up to increase the memory available (and have
enough shards to span the nodes).

On Sat, May 5, 2012 at 11:35 PM, Andy Wick andywick@gmail.com wrote:

I should have mentioned that two other things I did with the hopes of
reducing memory was to turn off replicates for this index and to disable
the all field.

Thanks,
Andy

Hi Shay

On Wed, 2012-05-09 at 11:35 +0300, Shay Banon wrote:

Fields that have multi data values can contribute greatly to the
memory used. There is nothing really to do about it in terms of
improving it, except for scaling out or up to increase the memory
available (and have enough shards to span the nodes).

I've seen this discussed a few times, and understand that the issue with
multi-values is that you construct an array with max_number_of_values
for each doc. So if you have one doc with 20 values in a field, then the
lookup table would have num_docs * 20 slots.

I assume this is done for efficient lookup. The downside of course being
the massive memory use.

Would it not be possible to have an alternate lookup table which is more
memory efficient, at the cost of slower lookups?

clint

On Sat, May 5, 2012 at 11:35 PM, Andy Wick andywick@gmail.com wrote:
I should have mentioned that two other things I did with the
hopes of reducing memory was to turn off replicates for this
index and to disable the all field.

    Thanks,
    Andy

So if I reduce the max number of elements in multi value that would help?

Assuming I'm not CPU bound (because of low query rate) and only memory
bound, is it better to add more memory to existing machines or add more
machines? Example: Should I go from 10x16G to 20x16G machines or 10x32G
machines? I assume the 10x32G because of overhead?

I might be getting 10x64G machines. Should I run 2 nodes (maybe 20G each)
so that I don't hit Java GC issues?

Thanks,
Andy

On Wednesday, May 9, 2012 4:35:39 AM UTC-4, kimchy wrote:

Fields that have multi data values can contribute greatly to the memory
used. There is nothing really to do about it in terms of improving it,
except for scaling out or up to increase the memory available (and have
enough shards to span the nodes).

On Sat, May 5, 2012 at 11:35 PM, Andy Wick andywick@gmail.com wrote:

I should have mentioned that two other things I did with the hopes of
reducing memory was to turn off replicates for this index and to disable
the all field.

Thanks,
Andy

Because most of the memory is number of document bound used (because of the
multi values and the current ways things are represented), you can do
either increase the memory on each machine, or have more machines (and
enough shards to span them).

On Wed, May 9, 2012 at 3:36 PM, Andy Wick andywick@gmail.com wrote:

So if I reduce the max number of elements in multi value that would help?

Assuming I'm not CPU bound (because of low query rate) and only memory
bound, is it better to add more memory to existing machines or add more
machines? Example: Should I go from 10x16G to 20x16G machines or 10x32G
machines? I assume the 10x32G because of overhead?

I might be getting 10x64G machines. Should I run 2 nodes (maybe 20G each)
so that I don't hit Java GC issues?

Thanks,
Andy

On Wednesday, May 9, 2012 4:35:39 AM UTC-4, kimchy wrote:

Fields that have multi data values can contribute greatly to the memory
used. There is nothing really to do about it in terms of improving it,
except for scaling out or up to increase the memory available (and have
enough shards to span the nodes).

On Sat, May 5, 2012 at 11:35 PM, Andy Wick andywick@gmail.com wrote:

I should have mentioned that two other things I did with the hopes of
reducing memory was to turn off replicates for this index and to disable
the all field.

Thanks,
Andy