Hi. I'm pretty new to Elasticsearch and I'm wondering if anyone could give me an advice on the following two use cases.
Use Case 1: Storing catalog data
I've got a few thousand product catalogs, each of them containing from around 10.000 up to around 3-4 Million products (50.000 in average). Each product document has a rough size of 2-3 kilobytes, where the majority of the size is occupied by 2 properties (description and a map of customer defined properties) - all other properties (around 25) are typically fixed size (e.g. integers or dates) or pretty short (like titles).
I'm performing e.g. the following queries on the catalogs (always on one catalog, never among many of them!):
- list all categories (each product is assigned to many of them), list all categories matching a full text query
- list all tags, list all tags matching a full text query
- there are a few other properties with the same search characteristics as 1,2
- list all products matching a few given tags, categories, etc...
- faceted search - reduce the results of 1-4 based on currently selected category, tag etc...
- I'm performing many thousand Multi-Gets by Id each day (typically every document of a catalog is fetched at least four to five times a day).
- I'm need the Ids of all changed documents since a given timestamp at least 2-3 times per day.
- Each document is updated once a day (sometimes multiple times a day but this is a rare case)
Number of catalogs is constantly growing (linear). I'm wondering how I should structure my indices for such a use case. One large index ? Or one for each catalog ? How many primary / secondary shards do I need for each of the indices in order to handle my throughput (write and read) ? Which / how many machines do I need ? Since I do not really perform statistical queries (at the moment) I guess that RAM is not my main issue ? Nevertheless - how much do I need ? I know that no-one can give me a 100% answer - a rough estimation helps!
Use Case 2 - Storing statistical data
I've got another dataset (A) closely related to the first one where I store statistical data about products. Basically I've got one row for each product of each catalog containing some daily calculated numbers (around 25) and I've got another dataset (B) where for each product some rows are added on a daily basis (< 5).
I need to query dataset A (again - always for a single catalog, never across multiple ones) and get top n results (up to around 1.000 - 2.000) ordered by any of the precalculated numbers.
I need to query dataset B (again - always for a single catalog, never across multiple ones) and get all rows for a certain product for a given time range.
Can anyone advice me how to structure this usecase ? Can / Should I use the same cluster for both usecases ?
Thank you very much!
Peter