Elasticsearch, always incoming tweets and performance

Romain_Giovanetti · December 9, 2016, 2:31pm

Hello,

For more than two years I've been working on a project that implies to continually collect and store tweets.

In the earlier months the system was incredibly fast and after that time, it started to getting slower and slower...

My team and I have concluded that, even if our queries have very limiting rules such as a very small range on the tweet's creation date, the more our index grows, the slower it gets, even for simple queries.

Today our cluster is made of 4 computers with a total of 100GB of RAM and 10TB of storage.

It's size is: 764Gi (1.50Ti)
It counts 1 415 348 474 (2 910 677 195) documents

Here is the mapping:

{  
    "state":"open",
    "settings":{  
        "index":{  
            "creation_date":"1464706655747",
            "legacy":{  
                "routing":{  
                    "hash":{  
                        "type":"org.elasticsearch.cluster.routing.DjbHashFunction"
                    },
                    "use_type":"false"
                }
            },
            "number_of_shards":"12",
            "number_of_replicas":"1",
            "uuid":"zmybcfvkS-uiXCusUGn-xg",
            "version":{  
                "created":"1070599",
                "upgraded":"2040199"
            }
        }
    },
    "mappings":{  
        "status":{  
            "_timestamp":{  

            },
            "properties":{  
                "hashtags":{  
                    "type":"string"
                },
                "created_at":{  
                    "format":"epoch_millis||dateOptionalTime",
                    "type":"date"
                },
                "language":{  
                    "index":"not_analyzed",
                    "type":"string"
                },
                "in_reply":{  
                    "properties":{  
                        "status_id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "user_id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "user_screen_name":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                },
                "urls":{  
                    "properties":{  
                        "expand_url":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "url":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                },
                "nperceptioncnn":{  
                    "type":"double"
                },
                "location":{  
                    "properties":{  
                        "lon":{  
                            "type":"double"
                        },
                        "lat":{  
                            "type":"double"
                        }
                    }
                },
                "text":{  
                    "type":"string"
                },
                "ngender":{  
                    "index":"not_analyzed",
                    "type":"string"
                },
                "nperception":{  
                    "type":"byte"
                },
                "user_mentions":{  
                    "properties":{  
                        "screen_name":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                },
                "user":{  
                        "screen_name":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                }
            }
        }
    },
    "aliases":[  
        "twitter2"
    ]
}

Did we concluded well? How can we solve that?
Thank you.

dadoonet · December 9, 2016, 2:49pm

Which version?

So you have 24 shards (12 primaries + 12 replicas).
764 gb I guess for primaries. Which means around 38 gb per shard.

We recommend keeping the size per shard under 20 gb.

Read also this: https://gibrown.com/2014/02/06/scaling-elasticsearch-part-2-indexing/#comment-1104

@Greg_Brown wrote:

We found that above 20GB our shards started getting slow, and 40GB shards started producing real problems. I think the only way to really understand this limit is to test it with your real data, but maybe you can save yourself some time and try to keep your shards down in the 5-10GB range. I imagine what hardware you are using would also have an effect.

Romain_Giovanetti · December 9, 2016, 3:06pm

We were using ES 1.7 but we moved two weeks ago to ES 2.4.

Your answser brings me another question, I've read in the past that a node shouldn't have too many shards but your solution implies to multiply the shards (so the lucene instances) on each node.

It will consume more CPU and more RAM, isn't it?

So I can't clearly determine if it would work and reindexing takes a lot of days so I'd better have the best solution before acting.

Thank you

dadoonet · December 9, 2016, 4:12pm

May be it will require more nodes indeed. At first you can start to add more nodes without needing to reindex and see if it helps.

But I guess you are already monitoring all that and you did not find too much CPU/Memory usage yet?

BTW, you can also create a new cluster running 5.1.1 and use the reindex from remote feature which will read all your data from your 2.4 cluster and reindex in the 5.1 cluster. Change the number of shards and see how it goes.
Reindex from remote allows you to run a query to only extract a subset of your data so you can do it on a smaller subset.

system · January 6, 2017, 4:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Setup of a cluster Elasticsearch	3	335	July 6, 2017
Elasticsearch performs slowly when data size increased Elasticsearch	3	934	March 21, 2017
Index optimum size Elasticsearch Cluster Elasticsearch	4	460	June 19, 2018
Slow queries during users peak Elasticsearch	9	532	July 27, 2018
How can I tune for Elasticsearch performance? Elasticsearch	9	624	May 18, 2020

Elasticsearch, always incoming tweets and performance

Related topics