Hi,
We're struggling with modeling our data in Elasticsearch, and decided to change it.
What we have today: single index to store product data, which holds data of 2 types -
[1] Some product data that changes rarely -
* name, category, URL, product attributes(e.g. color,price) etc...
[2] Product data that might change frequentley for past documents,
and indexed on a daily level - [KPIs]
* product-family, daily sales, daily price, daily views...
Our requirements are -
- Store product-related data (for millions of products)
- Index KPIs on a daily level, and store those KPIs for a period of 2 years.
- Update "product-family" on a daily level, for thousands of products. (no need to index it daily)
- Query and aggregate the data with low latency, to display it in our UI. aggregation examples -
- Sum all product sales in the last 3 months, from category 'A' and sort by total sales.
- Same as the above, but in-addition aggregate based on
product-family
field.
- Keep efficient indexing rate.
Currently, we're storing everything on the same index, daily, meaning we store repetitive data such as name, category and URL over and over again. This approach is very problematic for multiple reasons-
- We're holding duplicates for data of type [1], which hardly changes and causes the index to be very large.
- when data of type [2] changes , specifically the
product-family
field(this happens daily), it requires updating tens of millions of documents (from more than a year ago), which causes the system to be very slow and timeout on queries.
Splitting this data into 2 different indices won't work for us since we have to filter data of type [2] by data of type [1]
(e.g. all sales from category 'A'), moreover, we'll have to join that data somehow, and our backend server won't handle this load.
We're not sure how to model this data properly, our thoughts are -
- Using parent-child relations - parent is product data of type [1] and children are KPIs of type [2]
- Using nested fields to store KPIs (data of type [2]).
Both of these methods allow us to reduce the current index size by eliminating the duplicated data of type [1], and efficiently updating data of type [2] for very old documents.
Specifically, both methods allow us to store product-family
for each product once in the parent/non-nested fields, which implies we can only update a single document per product. (these updates are daily)
We think parent-child relation is more suitable, due to the fact that we're adding KPIs on a daily level,
which per our understanding - will cause re-indexing for documents with new KPIs when using nested fields.
On the other side, we're afraid that parent-child relations will increase query latency dramatically, hence will cause our UI to be very slow.
We're not sure what is the proper way to model the data, and if our solutions are on the right path,
we would appreciate any help since we're struggling with it for a long time.