Now we have about 5 millions of groups and 100 millions of users. There are about 120 million messages every day users sent to the group, approximate 12GiB. We'd like to use Elasticsearch to search group messages with text.
We're testing with a time-based index (index by day, week, and month). We are also using 5 shards/index with routing key is groupID.
The text analyzer is edge-nGram with min-gram is 2 and max-gram is 20.
The peak indexing rate is 4000 req/s. However, the searching is too slow. We've expected that we can process 3000 req/s for searching. Because of the big number of groups, Elasticsearch spent a lot of time to query.
We've tried to compare 5 shards/index and 11 shards/index for both daily-index and monthly-index, but searching on 11 shards/index is slower than 5 shards/index. Anyway, both sharding strategies are unsatisfied with our case.
My questions are:
- Which time-based index we should use in our case? By day, week, or month?
- How many shards we should use for the selected index?
I've read a blog from Discord talking about how Discord indexes billions of messages for group search. I know Discord is using their shard allocator which do mapping from groupId to (cluster, index) of Elasticsearch. Their mapping is using a database with caching. They don't use sharding from Elasticsearch because of their application-level sharding.
I concern that we should follow as Discord architecture or believe Elasticsearch Sharding.