I am trying to optimize text search over a very large set of documents. As a test dataset, I am using the Enron email dataset ~ 3GB, but an actual dataset may be 100GB+. My structure is basically GET /enron/email/1.
Constraints:
-
Search response times < 100 ms
-
Reduce duplication of entire dataset as much as possible
-
The documents can be assumed to be immutable.
-
Text may be searched in a wide range of ways. Here are some examples:
- Find all emails that mention the word "california power"
- Find all emails that contain "califoria" and fuzzy match that to california
- Find all emails that contain "easy money AND NOT hard work"
- Find all emails that contain "stealing OR blackmail"
It seems from my reading that I should prefer to have shards over replicas if I don't want to duplicate the data. I'd appreciate any insight you can give me. Thanks!