Advice on Index and Cluster Structure?

Hi. I'm pretty new to Elasticsearch and I'm wondering if anyone could give me an advice on the following two use cases.

Use Case 1: Storing catalog data

I've got a few thousand product catalogs, each of them containing from around 10.000 up to around 3-4 Million products (50.000 in average). Each product document has a rough size of 2-3 kilobytes, where the majority of the size is occupied by 2 properties (description and a map of customer defined properties) - all other properties (around 25) are typically fixed size (e.g. integers or dates) or pretty short (like titles).

I'm performing e.g. the following queries on the catalogs (always on one catalog, never among many of them!):

  1. list all categories (each product is assigned to many of them), list all categories matching a full text query
  2. list all tags, list all tags matching a full text query
  3. there are a few other properties with the same search characteristics as 1,2
  4. list all products matching a few given tags, categories, etc...
  5. faceted search - reduce the results of 1-4 based on currently selected category, tag etc...
  6. I'm performing many thousand Multi-Gets by Id each day (typically every document of a catalog is fetched at least four to five times a day).
  7. I'm need the Ids of all changed documents since a given timestamp at least 2-3 times per day.
  8. Each document is updated once a day (sometimes multiple times a day but this is a rare case)

Number of catalogs is constantly growing (linear). I'm wondering how I should structure my indices for such a use case. One large index ? Or one for each catalog ? How many primary / secondary shards do I need for each of the indices in order to handle my throughput (write and read) ? Which / how many machines do I need ? Since I do not really perform statistical queries (at the moment) I guess that RAM is not my main issue ? Nevertheless - how much do I need ? I know that no-one can give me a 100% answer - a rough estimation helps!

Use Case 2 - Storing statistical data

I've got another dataset (A) closely related to the first one where I store statistical data about products. Basically I've got one row for each product of each catalog containing some daily calculated numbers (around 25) and I've got another dataset (B) where for each product some rows are added on a daily basis (< 5).

I need to query dataset A (again - always for a single catalog, never across multiple ones) and get top n results (up to around 1.000 - 2.000) ordered by any of the precalculated numbers.

I need to query dataset B (again - always for a single catalog, never across multiple ones) and get all rows for a certain product for a given time range.

Can anyone advice me how to structure this usecase ? Can / Should I use the same cluster for both usecases ?

Thank you very much!
Peter

I'll go through things sort of stream of consciousness as I read them.

Ok. You can't have one Elasticsearch index per catalog. You'll need to dump a bunch of them into one index. Maybe the biggest ones can have their own index, but the others need to share. Elasticsearch just doesn't efficiently support many thousands of indexes in the cluster. Hundreds, sure, but not thousands.

You means an object with arbitrary stuff set by the customers, right? Not actual GIS coordinates for slices of land. I mean, we have support for both, but both bring their own considerations.

Arbitrary properties is a problem because it introduces sparsity in some data structures. It also allows your mapping to grow without bounds. Maybe it'd be ok in some cases but I've seen it lead to all kind of trouble. I suggest setting this to "dynamic": "false" so that all the arbitrary stuff is stored but not indexed. You can manually add fields to the index to make some of the arbitrary stuff searchable if you have to. You'd have to "touch" every document with that field but on indexes with ~10 million records you can get away with having inefficient batch operations every once in a while.

Elasticsearch doesn't really care either way. If you wanted to run one off queries for your own use across all of them it'd do just fine. It'd be slower because of the massive fan out, but it'd be ok.

Doing this almost implies that you have an index to store categories. You could do that, just a single index in your whole system of documents that looks like {"catalog": 1235, "category": "cats"}. Doing this against an index of products is going to be inefficient. You can do it with "aggregations" but I'd think twice if this is something you need to be super fast. Aggregations have to load the category from column-stride storage and turn it into buckets and stuff. If you had an index for categories you'd just be searching.

Same as above, but if you mean the user defined tags then even more so because I'm loath to index them by default.

A couple of wrenches in this.

  1. Do you mean all? Or do you mean "the most relevant" because Elasticsearch isn't really built to send back all matches when the total is more than about 10,000. At that point you switch to the scroll API, etc. If you just want to tell the user "there are 431,143 matches, these are the top 10" that is fine. It can be top 20 or top 100.
  2. Do you mean the user defined tags again? That can be trouble like I described above.

Fine so long as you aren't talking about arbitrary user defined tags.

The inefficient part of this is that documents are bundled with other documents and compressed. You have to uncompress the whole bundle until you get to the end of your document. For lots of use cases this is fine. It probably is for yours, but you should understand that it isn't a simple "read the bits off the disk" action.

If you put the change timestamp in the document (like, as a field you write) then you can just do this with a range query. Should work fine.

Updates are fairly expensive. You should figure out how to turn those updates into noops if possible. Elasticsearch has support for detecting noop in the update action. If it weren't for this requirement I'd say the number of machines you need is really based on the size of the working set for search. Because of this requirement you'll have to test with the constant updates too.

2 Likes

A fairly standard way of doing it is to put big catalogs (millions of products) on their own and keep small catalogs together. Just keep them together based on the date they were created or something. Once an index gets to be in the 10GB mark you should usually think about spinning up a new index for new catalogs.

s/secondary/replica/

Replica shards don't help with write throughput. They hurt because the writes are replayed verbatim on those shards. They do help with read throughput. The only way to figure this out is to test it. You'll want to test it under the expected write load because you are looking at ~5800 writes a second. That isn't trivial, especially because those are updates not plain writes. You should experiment with that. You can absolutely experiment with stuff like having multiple clusters.

Usually folks use 1 or 2 replicas. More for more paranoid use cases. More if they have very very high read load compared to write load.

Make sure you use _bulk for those updates. Very important if you want to get decent throughput. Also look at refresh interval.

You'll have to experiment. Because you are touching every product every day you can't use standard stuff like hot/warm. Wikimedia has a similar write load but they (likely) have more a couple of orders of magnitude more read load than you do. They use 24 very nice servers one cluster and 31 less nice ones in their other cluster. They have two clusters so they can fail over without downtime. I think its unlikely you'll need that many. Maybe you'll be able to get away with 3 if you don't have much read load and you can optimize the writes. Maybe you just want to buy 6 so you don't have to be so careful. I dunno, hard to tell. Also, it is early so I could be dropping a number somewhere.

Keeping the working set in the disk cache is quite important and you aren't helping things by updating everything once a day. If RAM isn't super expensive then buy it. It'll always help you with performance. 64GB is the minimum I'd suggest, but I expect you'll end up with more because it is cheapish.

500 million documents a day is fairly substantial to index. You're B dataset is even more so.

Like "which product/day combination had the highest X"? That kind of things is fine so long as you aren't hitting a zillion matches. It is O(max(hits, top_hits) * log(top_hits)).

all might be trouble again, if it is greater than 10,000. The query itself should be fine.

The write load troubles me. You'll have to experiment with it.

2 Likes

Wow! Thank you for your precise help! ... and your nice hint regarding
properties - foreign programmer's english is often kind'a misleading :slight_smile:

I am investigating to reduce the write load! Would it help to store a Hash,
fetch documents/or just the hash beforehands and issue Updates only for the
actually changed documents ?