Data Duplication Model with Nested Docs

Hello everyone,

We modeled our data lately and we're pretty much happy with the results - meaning that we managed to make the queries we wanted about our data. Our weekly index is about 150GB. Our search performance are somewhat poor (after a full week had past it might get 3-4 seconds for 6 queries sent at once), but we are working to get more resources to our cluster (currently 3 master nodes each with 2 GB RAM and 4 CPU, 3 data nodes each with 16 GB RAM and 6 CPU, 9 shards, 1 replica).

I want to ask about the way we chose to model our data. Let's imagine we want to store a product and the manufacturers which created them. So our data looks something like this: our root document is the product itself, and each product have nested documents represent its manufactures.

Actually it's considered a classic approach of using nested type. Each product document has a small amount of manufactures documents (between 1-30). The product document might be updated, but the manufacture document isn't. So The only thing bothers me here is the fact that each manufacture would be duplicated many times in our index. For example: If LG manufacturer produce 100 products, then 100 nested documents would be created. On the other hand, storage is considerd cheap nowsdays and duplication in BI system is very common.

This duplication is indeed convenient. A classic query in our system would be to count the top 10 products made by specific manufacturer. So we can query about products created in the last week, get all of the related manufactures, filter only manufactures located in UK and the sort the manufactures by the related products. Something like that:

{
	"query" : {
		"range" : {
			"date" : {
				"gte" : "now-7d/d",
				"lt" : "now/d"
			}
		}
	},
	"aggs" : {
		"GetAllRelevantManufacturers" : {
			"nested" : {
				"path" : "Manufacturers"
			},
			"aggs" : {
				"filterOnlyUK" : {
					"filter" : {
						"term" : {
							"Manufacturers.City" : "UK"
						}
					}
				}
			},
			"aggs" : {
				"top10Manufacturers" : {
					"terms" : {
						"field" : "Manufacturers.Company",
						"order" : {
							"backToProducts>productsCount" : "desc"
						}
					},
					"aggs" : {
						"backToProducts" : {
							"reverse_nested" : {},
							"aggs" : {
								"productsCount" : {
									"value_count" : {
										"field" : "productId"
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

What do you think about our modeling? Do you have a better approach?
We could have created different indices for manufacturers and products, but I think it would really make the joins much more complex in the client/server apps. I don't believe it would make the search performance a lot better either.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.