I'm trying to get over 1 billion documents loaded on 10 16G machines with a
faceted "tags" array field with about 8000 unique values. Looking at
elastichead, each document is around 800 bytes after compression.
So far to reduce memory I've
- switched from strings to shorts for the tags
- turned on source compression
- switched from 60 shards to 20 shards (maybe I need to go to 10?)
- set "index.cache.field.type: soft" although I'm not sure what that does
Suggestions on what to do next?
- For the obvious hardware change - if the choice was 20 16G machines or 10
32G machine is there a clear winner?
- Would it help if I split the single short field with 8000 values into 32
1 byte fields?
- Related, we could move some of the tags out to separate single item
fields, would that help? (For example we have tags for every country
instead of just a country field.)
- Wait for 0.20?