[Discussion] I have two options for designing my indexes, which one do you thinks will be better?

Hi guys,

Suppose I am willing to use elasticsearch for building a system for analyzing tweets. The tweets will be fetched and tagged before being indexed into ES.

I have two options for designing the indexes for tweets.
Option 1 : Create a one big index for all tweets and differentiate between the different tagged tweets using the _type.
Option 2 : Create a separate index for each tag.

If I choose the first option, I will end up with a one fat index contains all the documents. As the number of shards is determined once for the index and can not be changed, I can calculate the number of shards before creating the index according to my cluster size. Which means(as I understand), reducing the chance to get into Red or Yellow states and hence less management.
But I will be worry about the querying performance as I think querying one very fat/big index will not be efficient as querying a small one.
Another issue with that approach is that sometimes the same tweet comes twice with different tags. If I decided to update the first tweet document, this will complicate the streaming application writing to ES as I need to get the first tag and then update the tag with the new one.

If I choose the second option, I will end up with many indexes (one for each tag) and these indexes will be created dynamically for each new tag. There are two problems with that approach. The first problem is having increasing number of shards which increases the chance to get into the Red or Yellow state of the cluster. The other problem is that sometimes we need to query two or more tags at the same time. having a tweet matches two tags, the tweet will be indexed twice and the results will not be accurate.

Which approach do you think will be better?

How many tags do you expect to have?

I'd probably use only one index and set the tag as a field, no need to have multiple _types for that, especially given that types are on the way out.

How are you going to manage retention of tweets? Have you considered using a time-based indexing scheme?

There is no expectation about the number of tags. This part will be totally dynamic. Today we may get tweets talking about a visit of the president and tomorrow may get tweets talking about north korea and so one.
Note that the some tweets could match the two tags.

What should I do in this scenario? should I make the tag field as a list and update the old tweet if it comes again in the stream and adding the new tag to the tags list? Or inserting a new tweet document with different ID and different tag field?

The first option will reduce the disk space and be better for performance (querying less number of documents) but it will complicate the stream as for each coming tweet I will have to check if the same tweet exist in ES or not. Taking into consideration millions of tweets are fetched and indexed every day, this will not be good (I think).

The second option will increase the disk space and the number of document which will have a bad affect on the query performance. But, it will make the indexing process straight forward.

What do you think?

I do not have clear requirements about the retention at this point and I also did not read about time-based indexing. I will check it an back to you.
Thanks Christian :slight_smile:

If you have more than tens of tags then you'll have too many indexes. Maybe a time-based approach of 1-index/month and store the tag(s) as keyword

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.