Is has_child (parent/child) the best strategy for my use-case involving products and inventory?


(Ben Hirsch) #1

We are considering using Elastic Search for an upcoming project. I have
done quite a bit of research on the API and am curious about a few things
as they relate to a specific but very necessary use-case for the project.
If we cannot satisy this use-case I do not think ES will be right for us.

My research has led to me to using a parent/child relationship and the
has_child query (I also looked at the 'nested' and 'inner object'). But I
am not sure if this is the best approach as I am still wrapping my head
around how to best strategize for Elastic Search and denormalize my data.
We currently have a relational database in place and are planning on
setting up Elastic Search to run along-side this DB as our search
repository.

The use-case is as follows:

  • We are storing product information (300,000+ products) as type 'product'.
  • We are also storing inventory data for 20,000+ retailers.
  • Each product has a set of UPCs and each retailer has a list of UPCs they
    carry along with the quantities in stock.
  • The 'product_retailers' type stores ALL of the retailers who carry the
    parent product. This will be re-indexed very often (at least once an hour
    for each product)
  • The document model I am proposing we use:

$ curl -XPUT 'http://localhost:9200/products/product/1' -d '{
{
'name' : 'Foo',
'description' : 'Bar...',
...
}'

$ curl -XPUT 'http://localhost:9200/products/product_retailers/_mapping' -d
'
{
{
"product_retailers":{
"_parent":{
"type" : "product"
}
}
}
}'

$ curl -XPUT 'http://localhost:9200/products/product_retailers/?parent=1'
-d '
{
{
"id" : 888, // the retailer id
"upcs" : {

 {
 "code" : 123456789012,
 "quantity" : 22
 },
 {
 "code" : 123456789013,
 "quantity" : 19
 },
 {
 "code" : 123456789014,
 "quantity" : 27
 },

 ...
}

},
{
"id" : 889, // the retailer id
"upcs" : {

 {
 "code" : 123456789012,
 "quantity" : 11
 },
 {
 "code" : 123456789013,
 "quantity" : 2
 },
 {
 "code" : 123456789014,
 "quantity" : 1
 },

 ...
}

}
}'

  • We need to be able to filter product results (based on keyword matches)
    filtered against a set of retailer IDs for whom have the product in stock.
  • Another way to put it, given a list of retailer ids and a search
    keyphrase we need to be able to return matching products.
  • A huge bonus would be to ALSO include the data about the matching
    retailers in the result set.

Is this even possible with ES? Am I going about modeling my data correctly
so that it can scale well to the quantity of items we are storing.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bd2d9424-af9b-442f-9a30-919a250133f9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Michael Sokolov) #2

Ben, this is a difficult use case for the Lucene index on which ES is
built, because in essence you have two primary objects of interest, and a
relationship between them.

The parent/child relation is useful if you have a (document) tree, so you
could use it to relate products and upcs, I think but in this instance you
also have a many-many from retailer to product (and upc), and the
parent/child thing won't help you there.

We have a similar problem in my company with access control to documents.
Our customers acquire access to documents: we have a lot of documents and a
lot of customers (although many fewer than documents), and we want
customers to be able to search only for products they have access to.
We're only filtering on a single customer, where you want to use a set, but
the principle is similar.

The usual approach to denormalizing, and what we have been doing is to
build queries using all the product ids the customer has access to, but
this can only scale up to about 1000 products per customer since the
queries get too many terms and slow down eventually. The only reason we
were able to do this up until now is that many of our sites have a small
number of huge products that have lots of documents in them. But now we
are getting sites with a lot of medium-sized products, and we are having
queries blow up with too many terms.

Then we thought well we'll denormalize by indexing the customer access
relation in with the document: basically tag every product document with
the customers that have it. This works great for search since you only need
a single query term per customer, but places a huge burden on the indexer,
which becomes complex and has to do a lot more updates. Especially if you
are going to index every hour, there will presumably be a lot of change,
although possibly incremental? We haven't actually tried this in
production, but I have been doing some calculations and I think the
indexing cost will be prohibitive. The answer is hard to be definite about
because it is highly dependent on the distribution of the data. But this
is the most natural answer for a search index like Lucene/ES.

Currently I'm experimenting with generating product groups automatically.
I believe our customers tend to buy the same groups of products, and if
that's true, we can index those groups in the products, and record the
relation of customer->group, but this is kind of complicated and not
working yet in a form where I can share: sorry.

If you get your queries working OK, I think you will be able to retrieve
customer ids, but the devil is in the details, of course.

-Mike

On Friday, February 14, 2014 10:01:22 PM UTC-5, Ben Hirsch wrote:

We are considering using Elastic Search for an upcoming project. I have
done quite a bit of research on the API and am curious about a few things
as they relate to a specific but very necessary use-case for the project.
If we cannot satisy this use-case I do not think ES will be right for us.

My research has led to me to using a parent/child relationship and the
has_child query (I also looked at the 'nested' and 'inner object'). But
I am not sure if this is the best approach as I am still wrapping my head
around how to best strategize for Elastic Search and denormalize my data.
We currently have a relational database in place and are planning on
setting up Elastic Search to run along-side this DB as our search
repository.

The use-case is as follows:

  • We are storing product information (300,000+ products) as type 'product'.
  • We are also storing inventory data for 20,000+ retailers.
  • Each product has a set of UPCs and each retailer has a list of UPCs they
    carry along with the quantities in stock.
  • The 'product_retailers' type stores ALL of the retailers who carry the
    parent product. This will be re-indexed very often (at least once an hour
    for each product)
  • The document model I am proposing we use:

$ curl -XPUT 'http://localhost:9200/products/product/1' -d '{
{
'name' : 'Foo',
'description' : 'Bar...',
...
}'

$ curl -XPUT 'http://localhost:9200/products/product_retailers/_mapping'
-d '
{
{
"product_retailers":{
"_parent":{
"type" : "product"
}
}
}
}'

$ curl -XPUT 'http://localhost:9200/products/product_retailers/?parent=1'
-d '
{
{
"id" : 888, // the retailer id
"upcs" : {

 {
 "code" : 123456789012,
 "quantity" : 22
 },
 {
 "code" : 123456789013,
 "quantity" : 19
 },
 {
 "code" : 123456789014,
 "quantity" : 27
 },

 ...
}

},
{
"id" : 889, // the retailer id
"upcs" : {

 {
 "code" : 123456789012,
 "quantity" : 11
 },
 {
 "code" : 123456789013,
 "quantity" : 2
 },
 {
 "code" : 123456789014,
 "quantity" : 1
 },

 ...
}

}
}'

  • We need to be able to filter product results (based on keyword matches)
    filtered against a set of retailer IDs for whom have the product in stock.
  • Another way to put it, given a list of retailer ids and a search
    keyphrase we need to be able to return matching products.
  • A huge bonus would be to ALSO include the data about the matching
    retailers in the result set.

Is this even possible with ES? Am I going about modeling my data correctly
so that it can scale well to the quantity of items we are storing.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1f792dcf-63a0-43dc-9a88-668a652a3ca1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ben Hirsch) #3

argh I just wrote a reply but google ate it apparently.

So, are you suggesting that scrapping parent/child and simply storing all
of the retailer data in the product document is a safer bet. I imagine we
could rate limit our product indexing. However this now gives me two
concerns: 1 - the size of the product document. We would have product
documents with 20,000+ UPC entries as nested objects! and 2 - would our
search results also return those 20,000 nested objects?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/413303b2-a5f2-441e-94da-db7e802eff57%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Michael Sokolov) #4

The problem with the parent/child thing is that each child can only have a
single parent. So if you index what -- in your model -- is the same
product as a child of multiple vendors, and then you want to search for
multiple vendors at once, you are going to get multiple copies of the child
product, one for each vendor that has it. I think probably if you were
always searching for a single vendor at a time, this could work though.

Indexing all the vendor ids in each product document probably won't add any
indexing overhead relative to the parent/child model you proposed -- in
both cases you have to reindex the products when a vendor picks them up or
drops them, although in the parent/child case ES does this for you
internally.

Re: the huge documents; yes that could be a problem. Although having lots
of terms in the index is not really an issue, the fact that ES stores your
entire document and gives it back to you by default will chew up storage
and bandwidth, as you say. To avoid this, you could disable the _source
field:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html

-Mike

PS I got your first reply - it just didn't go to the list

On Friday, February 14, 2014 11:34:11 PM UTC-5, Ben Hirsch wrote:

argh I just wrote a reply but google ate it apparently.

So, are you suggesting that scrapping parent/child and simply storing all
of the retailer data in the product document is a safer bet. I imagine we
could rate limit our product indexing. However this now gives me two
concerns: 1 - the size of the product document. We would have product
documents with 20,000+ UPC entries as nested objects! and 2 - would our
search results also return those 20,000 nested objects?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd7982c0-864b-4c5d-b0a9-7ea5c2d217bc%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ben Hirsch) #5

thank you for your suggestions, Michael. I am going to start testing with
using a nested object for my retailer data and then also disabling the
source storage of the original JSON. Will post my results.

On Saturday, February 15, 2014 9:35:10 AM UTC-5, Michael Sokolov wrote:

The problem with the parent/child thing is that each child can only have a
single parent. So if you index what -- in your model -- is the same
product as a child of multiple vendors, and then you want to search for
multiple vendors at once, you are going to get multiple copies of the child
product, one for each vendor that has it. I think probably if you were
always searching for a single vendor at a time, this could work though.

Indexing all the vendor ids in each product document probably won't add
any indexing overhead relative to the parent/child model you proposed -- in
both cases you have to reindex the products when a vendor picks them up or
drops them, although in the parent/child case ES does this for you
internally.

Re: the huge documents; yes that could be a problem. Although having lots
of terms in the index is not really an issue, the fact that ES stores your
entire document and gives it back to you by default will chew up storage
and bandwidth, as you say. To avoid this, you could disable the _source
field:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html

-Mike

PS I got your first reply - it just didn't go to the list

On Friday, February 14, 2014 11:34:11 PM UTC-5, Ben Hirsch wrote:

argh I just wrote a reply but google ate it apparently.

So, are you suggesting that scrapping parent/child and simply storing all
of the retailer data in the product document is a safer bet. I imagine we
could rate limit our product indexing. However this now gives me two
concerns: 1 - the size of the product document. We would have product
documents with 20,000+ UPC entries as nested objects! and 2 - would our
search results also return those 20,000 nested objects?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1e0b59fb-0514-40a4-8e8d-20f807aabcd0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6