I've taken over a price comparison website that is using inefficient MySQL queries for the search functionality. It barely works.
I know that I want to use Elastic for the search functionality but I'm finding it near impossible to find my use case as a reference when building the index. My background is in eCommerce working with product data, whereas this scenario is different as the parent/child relationship is at a different level. The application is a product discovery website that offers price comparison across a range of retailers. Similar to the likes of PriceRunner and PriceSpy. The data comes from external sources initially (CSV, XML, API) which is imported to a database. That's then served to the end user to compare and links through to the retailer to buy.
Here's some notes on my scenario:
- The product and retailer data is held in a MySQL database, where the data will be regularly indexed to Elastic after being normalised.
- There'll be a "parent" product which contains all of the key product information that doesn't relate to a retailer (name, description, category, image). This will be based on the most recently updated data from retailers so isn't fixed data and will update when changes are made in the original data.
- The products have an EAN/GTIN field which is the same across retailers. For example, a product with the code 1234356 could be sold in Shop A, Shop B and Shop C. The data from the retailer is my "child" data which will contain URL, price, etc...
- The traditional ecommerce child/parent relationship for things like size or colour will be separated into the "parent" products so this doesn't require any consideration.
- The data will be updated every evening with any changes from the retailers.
- The database will contain around 2 million products, with about 1 million of those being "parent" products but will grow quickly once the application is live.
- The retailer needs to be filterable as a term, as well as the usual parent information like colour/size/category.
I'm trying to figure out the best way to build the index to ensure it is best practice, fast and scalable to 100 million "child" products. I know I need to normalise the data as much as possible but have explored using nested fields too for easier updates. I've also looked at using collapse but the application needs to know the number of results in each "parent" and I can't see a way to get that at query runtime. What would you propose in this situation?