We are using Elasticsearch for our recommendations system.
Every index is a product catalog: cana hold between 10K-20M documents.
Each document represent one product.
Product document contains:
Metadata such as: product id, price, in stock, categories (array) etc.
Statistical data such as: popularity score, most-viewed-together (array of other product ids) etc.
- What are the most popular products (by their popularity score) with categories X, Y and Z that are in stock.
- For the fairness of popularity - when updating popularity score, we want to update all products score - since popularity is relative among all products.
- Popularity update can take long time. Updating 20M documents can take dozens of minutes.
- Updating metadata should be quick (few minutes max). If price changed, we must update the new price ASAP.
Today we are creating new index each time metadata changed, with the fresh and updated catalog, calculating the popularity score for each product and replacing the old index.
This is a long process that doesn’t meet the few minutes metadata update requirement.
We are thinking of separating the catalog into two different indices: metadata and statistics.
Thus, metadata could be updated on the existing metadata index (updating only diffenerents / deltas) and new statistics index could be calculated twice a day and replace the old one.
This way metadata will be updated quickly and the fairness of popularity stays.
How does it sounds?
It creates a new issue with the querying / serving side.
How do I query the products with categories X, Y and Z (from metadata index) with the highest popularity score (from statistics index)?