Suppose I am willing to use elasticsearch for building a system for analyzing tweets. The tweets will be fetched and tagged before being indexed into ES.
I have two options for designing the indexes for tweets.
Option 1 : Create a one big index for all tweets and differentiate between the different tagged tweets using the _type.
Option 2 : Create a separate index for each tag.
If I choose the first option, I will end up with a one fat index contains all the documents. As the number of shards is determined once for the index and can not be changed, I can calculate the number of shards before creating the index according to my cluster size. Which means(as I understand), reducing the chance to get into Red or Yellow states and hence less management.
But I will be worry about the querying performance as I think querying one very fat/big index will not be efficient as querying a small one.
Another issue with that approach is that sometimes the same tweet comes twice with different tags. If I decided to update the first tweet document, this will complicate the streaming application writing to ES as I need to get the first tag and then update the tag with the new one.
If I choose the second option, I will end up with many indexes (one for each tag) and these indexes will be created dynamically for each new tag. There are two problems with that approach. The first problem is having increasing number of shards which increases the chance to get into the Red or Yellow state of the cluster. The other problem is that sometimes we need to query two or more tags at the same time. having a tweet matches two tags, the tweet will be indexed twice and the results will not be accurate.
Which approach do you think will be better?