Elasticsearch, always incoming tweets and performance


(Romain Giovanetti) #1

Hello,

For more than two years I've been working on a project that implies to continually collect and store tweets.

In the earlier months the system was incredibly fast and after that time, it started to getting slower and slower...

My team and I have concluded that, even if our queries have very limiting rules such as a very small range on the tweet's creation date, the more our index grows, the slower it gets, even for simple queries.

Today our cluster is made of 4 computers with a total of 100GB of RAM and 10TB of storage.

It's size is: 764Gi (1.50Ti)
It counts 1 415 348 474 (2 910 677 195) documents

Here is the mapping:

{  
    "state":"open",
    "settings":{  
        "index":{  
            "creation_date":"1464706655747",
            "legacy":{  
                "routing":{  
                    "hash":{  
                        "type":"org.elasticsearch.cluster.routing.DjbHashFunction"
                    },
                    "use_type":"false"
                }
            },
            "number_of_shards":"12",
            "number_of_replicas":"1",
            "uuid":"zmybcfvkS-uiXCusUGn-xg",
            "version":{  
                "created":"1070599",
                "upgraded":"2040199"
            }
        }
    },
    "mappings":{  
        "status":{  
            "_timestamp":{  

            },
            "properties":{  
                "hashtags":{  
                    "type":"string"
                },
                "created_at":{  
                    "format":"epoch_millis||dateOptionalTime",
                    "type":"date"
                },
                "language":{  
                    "index":"not_analyzed",
                    "type":"string"
                },
                "in_reply":{  
                    "properties":{  
                        "status_id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "user_id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "user_screen_name":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                },
                "urls":{  
                    "properties":{  
                        "expand_url":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "url":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                },
                "nperceptioncnn":{  
                    "type":"double"
                },
                "location":{  
                    "properties":{  
                        "lon":{  
                            "type":"double"
                        },
                        "lat":{  
                            "type":"double"
                        }
                    }
                },
                "text":{  
                    "type":"string"
                },
                "ngender":{  
                    "index":"not_analyzed",
                    "type":"string"
                },
                "nperception":{  
                    "type":"byte"
                },
                "user_mentions":{  
                    "properties":{  
                        "screen_name":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                },
                "user":{  
                        "screen_name":{  
                            "index":"not_analyzed",
                            "type":"string"
                        },
                        "id":{  
                            "index":"not_analyzed",
                            "type":"string"
                        }
                    }
                }
            }
        }
    },
    "aliases":[  
        "twitter2"
    ]
}

Did we concluded well? How can we solve that?
Thank you.


(David Pilato) #2

Which version?

So you have 24 shards (12 primaries + 12 replicas).
764 gb I guess for primaries. Which means around 38 gb per shard.

We recommend keeping the size per shard under 20 gb.

Read also this: https://gibrown.com/2014/02/06/scaling-elasticsearch-part-2-indexing/#comment-1104

@Greg_Brown wrote:

We found that above 20GB our shards started getting slow, and 40GB shards started producing real problems. I think the only way to really understand this limit is to test it with your real data, but maybe you can save yourself some time and try to keep your shards down in the 5-10GB range. I imagine what hardware you are using would also have an effect.


(Romain Giovanetti) #3

We were using ES 1.7 but we moved two weeks ago to ES 2.4.

Your answser brings me another question, I've read in the past that a node shouldn't have too many shards but your solution implies to multiply the shards (so the lucene instances) on each node.

It will consume more CPU and more RAM, isn't it?

So I can't clearly determine if it would work and reindexing takes a lot of days so I'd better have the best solution before acting.

Thank you


(David Pilato) #4

May be it will require more nodes indeed. At first you can start to add more nodes without needing to reindex and see if it helps.

But I guess you are already monitoring all that and you did not find too much CPU/Memory usage yet?

BTW, you can also create a new cluster running 5.1.1 and use the reindex from remote feature which will read all your data from your 2.4 cluster and reindex in the 5.1 cluster. Change the number of shards and see how it goes.
Reindex from remote allows you to run a query to only extract a subset of your data so you can do it on a smaller subset.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.