Suppose I am willing to use elasticsearch for building a system for analyzing tweets. The tweets will be fetched and tagged before being indexed into ES.
I have two options for designing the indexes for tweets.
Option 1 : Create a one big index for all tweets and differentiate between the different tagged tweets using the _type.
Option 2 : Create a separate index for each tag.
If I choose the first option, I will end up with a one fat index contains all the documents. As the number of shards is determined once for the index and can not be changed, I can calculate the number of shards before creating the index according to my cluster size. Which means(as I understand), reducing the chance to get into Red or Yellow states and hence less management.
But I will be worry about the querying performance as I think querying one very fat/big index will not be efficient as querying a small one.
Another issue with that approach is that sometimes the same tweet comes twice with different tags. If I decided to update the first tweet document, this will complicate the streaming application writing to ES as I need to get the first tag and then update the tag with the new one.
If I choose the second option, I will end up with many indexes (one for each tag) and these indexes will be created dynamically for each new tag. There are two problems with that approach. The first problem is having increasing number of shards which increases the chance to get into the Red or Yellow state of the cluster. The other problem is that sometimes we need to query two or more tags at the same time. having a tweet matches two tags, the tweet will be indexed twice and the results will not be accurate.
There is no expectation about the number of tags. This part will be totally dynamic. Today we may get tweets talking about a visit of the president and tomorrow may get tweets talking about north korea and so one.
Note that the some tweets could match the two tags.
What should I do in this scenario? should I make the tag field as a list and update the old tweet if it comes again in the stream and adding the new tag to the tags list? Or inserting a new tweet document with different ID and different tag field?
The first option will reduce the disk space and be better for performance (querying less number of documents) but it will complicate the stream as for each coming tweet I will have to check if the same tweet exist in ES or not. Taking into consideration millions of tweets are fetched and indexed every day, this will not be good (I think).
The second option will increase the disk space and the number of document which will have a bad affect on the query performance. But, it will make the indexing process straight forward.
I do not have clear requirements about the retention at this point and I also did not read about time-based indexing. I will check it an back to you.
Thanks Christian
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.