I am creating an online shopping website. Users can create advertisements on the website to sell their products... these advertisements get expired or sold after some time...
I am keeping all advertisement in MySQL and copy the advertisement table into ElasticSearch. I want to create a tool as described here, to keep ES in sync.
Once advertisements expire, I Delete them from ES... so eventually my ES will contain a lot of deleted documents.
Is it a good idea to delete the whole Advertisement Index in ES periodically and recreate it with only the active advertisements? Or, does the merging process in ES takes care of deleted documents, and I don't need to worry about deleting and recreating the whole index periodically?
Elasticsearch takes care of deleted documents and eventually removed them from disk while merging segments.
So it's ok in general to remove some documents.
But if you are planning to remove a huge number of documents like more than 50-80% of your dataset in one go, then I think it's better to reindex the whole dataset.
You can also use time based indices and index documents by their expiration date in dedicated indices like index-2018-04 and index-2018-05 for documents expiring in April and May.
You can filter out documents you don't want to see anymore with a date range filter. After a month (let's say in May), you can just delete the index index-2018-04. That would be efficient IMO.
So all advertisement will expire after a month or so... I would probably be deleting less than 1% of the index on each bulk delete request... but after around 1 month, 100% of Index is deleted as all advertisements eventually expire. From your explanation, it sounds like I should be fine with just deleting them as they expire (probably combining a couple of deletes in one bulk delete)? Is my assumption correct that ES merge will happen on a regular basis (say everyday) and therefore this approach could work?
I really like your second solution. the problem is I would always have to search the two most recent indices. for example an advertisement created 29th of March, is still active on 6th of April, so I need to search both March and April Indices. Do you think this would work?
We have implemented elasticsearch on a SaaS with millions of users and they are active approx 5-6 hours a day .. I have multiple docs in one index .... we are using delete document procedure .. and we are not facing any problem .. but when ever we stuck to the situation where we need to recreate index .. the process takes much time . because that time we fetch large data to put into elasticsearch. So my suggestion is, please consider delete document.
But if you have single doc in single index then may be you can consider deleting the index.
the problem is I would always have to search the two most recent indices.
Even though you are probably not going to implement that, let me add some comments.
Not really an issue. Just use an alias anytime you create a monthly index.
mydata -> index-2018-04 mydata -> index-2018-05
Then you just have to search on mydata in your application. Elasticsearch will run the search on all indices.
Then if you delete the index-2018-04, it won't be searched anymore.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.