How to scale text search with very large indices

I am trying to optimize text search over a very large set of documents. As a test dataset, I am using the Enron email dataset ~ 3GB, but an actual dataset may be 100GB+. My structure is basically GET /enron/email/1.

Constraints:

  • Search response times < 100 ms

  • Reduce duplication of entire dataset as much as possible

  • The documents can be assumed to be immutable.

  • Text may be searched in a wide range of ways. Here are some examples:

    • Find all emails that mention the word "california power"
    • Find all emails that contain "califoria" and fuzzy match that to california
    • Find all emails that contain "easy money AND NOT hard work"
    • Find all emails that contain "stealing OR blackmail"

It seems from my reading that I should prefer to have shards over replicas if I don't want to duplicate the data. I'd appreciate any insight you can give me. Thanks!

I would watch this video from elastic{ON}'16: https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

1 Like