How to scale text search with very large indices

jpotts18 · March 9, 2016, 6:42am

I am trying to optimize text search over a very large set of documents. As a test dataset, I am using the Enron email dataset ~ 3GB, but an actual dataset may be 100GB+. My structure is basically GET /enron/email/1.

Constraints:

Search response times < 100 ms
Reduce duplication of entire dataset as much as possible
The documents can be assumed to be immutable.
Text may be searched in a wide range of ways. Here are some examples:
- Find all emails that mention the word "california power"
- Find all emails that contain "califoria" and fuzzy match that to california
- Find all emails that contain "easy money AND NOT hard work"
- Find all emails that contain "stealing OR blackmail"

It seems from my reading that I should prefer to have shards over replicas if I don't want to duplicate the data. I'd appreciate any insight you can give me. Thanks!

dadoonet · March 9, 2016, 7:26am

I would watch this video from elastic{ON}'16: https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing