Best approach for modeling multiple prices per product. Would nested docs work?

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

  1. Return every hotel only once in the resultset.
  2. required in query: each query always requests a price for a
    particular <date,duration, nr of persons, roomtype>-combo. This can not be
    omitted.
  3. optional in query: filter on hotel specific info (user-rating,
    facilities)
  4. optional in query: filter on 'price' (min/max). Here 'price' is the
    price related to the required <date,duration, nr of persons, roomtype>-combo
  5. optional in query: sort on price (asc + desc) and/or sort on
    hotel-fields like rating.
  6. output: hotelid + price
  7. even better would be if I could return a 'payload' that is attached to
    the specific <date,duration, nr of persons, roomtype>-combo besides the
    price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work?
(http://www.elasticsearch.org/guide/reference/mapping/nested-type.html)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query
(http://www.elasticsearch.org/guide/reference/query-dsl/has-parent-query.html)
I can (please correct if wrong)

  • fetch the childdocs for which 'key' matches the
    user-supplied: <date,duration, nr of persons, roomtype>
  • sort on price (of the matching childdoc)
  • filter on price (of the matching childdoc)
  • filter on some fields in the parent-doc (the hotel) such as hotel-rating
  • return the matching child-doc

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)
B. sort on hotel-fields like hotel-rating.

The question:

  • are A. and B. supported in the proposed solution above?
  • would the idea outlined above work, any caveats I should be aware of?
  • some other (better) way to model this?

Thanks,
Geert-Jan

--

Hello Geert-Jan,

There's a difference between parent and child documents[0] and nested
documents[1]. You can even have both if you need to, although it might not
help your use case.

Both of them are under the hood separate documents, but they're different
from the searching application's POV. The nested document is treated as a
single doc (which makes it easier to search), while parent and child
documents are treated separately (which makes it easier to index/update).
I'm not sure which one's best for you.

I'll try to address your points inline.

On Wed, Dec 12, 2012 at 6:39 PM, Geert-Jan Brits gbrits@gmail.com wrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

  1. Return every hotel only once in the resultset.
  2. required in query: each query always requests a price for a
    particular <date,duration, nr of persons, roomtype>-combo. This can not be
    omitted.
  3. optional in query: filter on hotel specific info (user-rating,
    facilities)
  4. optional in query: filter on 'price' (min/max). Here 'price' is the
    price related to the required <date,duration, nr of persons, roomtype>-combo
  5. optional in query: sort on price (asc + desc) and/or sort on
    hotel-fields like rating.
  6. output: hotelid + price
  7. even better would be if I could return a 'payload' that is attached to
    the specific <date,duration, nr of persons, roomtype>-combo besides the
    price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

  • fetch the childdocs for which 'key' matches the
    user-supplied: <date,duration, nr of persons, roomtype>
  • sort on price (of the matching childdoc)
  • filter on price (of the matching childdoc)
  • filter on some fields in the parent-doc (the hotel) such as hotel-rating
  • return the matching child-doc

Right. You can do that with either parent-child or nested docs.

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)

In nested documents you get your whole doc, with all sub-documents.

If you search child documents, you have the ID of the parent in the
"_parent" field of each child doc.

B. sort on hotel-fields like hotel-rating.

Again, with nested docs, it's easy. While with parent-child, you have to do
your search in the hotel (parent) docs to have results sorted by fields in
there. If you want to add pricing (children) criteria, you can add a
has_child query in there:

The question:

  • are A. and B. supported in the proposed solution above?

I think they are, yes.

  • would the idea outlined above work, any caveats I should be aware of?

has_child filters and queries are run first, and then the "parent"
query/filter. IDs are loaded into memory, and since you say you have lots
of children in general, I would assume such queries will be
memory-intensive.

On the other hand, updating (which implies reindexing) nested docs with
20,000 would also be resource-intensive.

  • some other (better) way to model this?

You might want to do at least some of the combining of rooms/persons/room
type at query time using script fields:

But that would make your queries slower, of course.

[0] Elasticsearch Platform — Find real-time answers at scale | Elastic
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Radu,

I didn't get alerted on your post (pretty annoying google groups stuff if
you ask me)

Thanks for confirming the stuff should work. I'll check your suggestion
about script-fields, really helpful.

Geert-Jan

On Saturday, December 15, 2012 4:14:48 PM UTC+1, Radu Gheorghe wrote:

Hello Geert-Jan,

There's a difference between parent and child documents[0] and nested
documents[1]. You can even have both if you need to, although it might not
help your use case.

Both of them are under the hood separate documents, but they're different
from the searching application's POV. The nested document is treated as a
single doc (which makes it easier to search), while parent and child
documents are treated separately (which makes it easier to index/update).
I'm not sure which one's best for you.

I'll try to address your points inline.

On Wed, Dec 12, 2012 at 6:39 PM, Geert-Jan Brits <gbr...@gmail.com<javascript:>

wrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

  1. Return every hotel only once in the resultset.
  2. required in query: each query always requests a price for a
    particular <date,duration, nr of persons, roomtype>-combo. This can not be
    omitted.
  3. optional in query: filter on hotel specific info (user-rating,
    facilities)
  4. optional in query: filter on 'price' (min/max). Here 'price' is the
    price related to the required <date,duration, nr of persons, roomtype>-combo
  5. optional in query: sort on price (asc + desc) and/or sort on
    hotel-fields like rating.
  6. output: hotelid + price
  7. even better would be if I could return a 'payload' that is attached
    to the specific <date,duration, nr of persons, roomtype>-combo besides the
    price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

  • fetch the childdocs for which 'key' matches the
    user-supplied: <date,duration, nr of persons, roomtype>
  • sort on price (of the matching childdoc)
  • filter on price (of the matching childdoc)
  • filter on some fields in the parent-doc (the hotel) such as
    hotel-rating
  • return the matching child-doc

Right. You can do that with either parent-child or nested docs.

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)

In nested documents you get your whole doc, with all sub-documents.

If you search child documents, you have the ID of the parent in the
"_parent" field of each child doc.

B. sort on hotel-fields like hotel-rating.

Again, with nested docs, it's easy. While with parent-child, you have to
do your search in the hotel (parent) docs to have results sorted by fields
in there. If you want to add pricing (children) criteria, you can add a
has_child query in there:
Elasticsearch Platform — Find real-time answers at scale | Elastic

The question:

  • are A. and B. supported in the proposed solution above?

I think they are, yes.

  • would the idea outlined above work, any caveats I should be aware of?

has_child filters and queries are run first, and then the "parent"
query/filter. IDs are loaded into memory, and since you say you have lots
of children in general, I would assume such queries will be
memory-intensive.

On the other hand, updating (which implies reindexing) nested docs with
20,000 would also be resource-intensive.

  • some other (better) way to model this?

You might want to do at least some of the combining of rooms/persons/room
type at query time using script fields:
Elasticsearch Platform — Find real-time answers at scale | Elastic

But that would make your queries slower, of course.

[0] Elasticsearch Platform — Find real-time answers at scale | Elastic
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Radu: Did you end up implementing this? I am curious how it worked since I
am thinking on doing something similar.

Thanks,
Nicolas

On Friday, December 21, 2012 7:22:16 PM UTC+8, Geert-Jan Brits wrote:

Radu,

I didn't get alerted on your post (pretty annoying google groups stuff if
you ask me)

Thanks for confirming the stuff should work. I'll check your suggestion
about script-fields, really helpful.

Geert-Jan

On Saturday, December 15, 2012 4:14:48 PM UTC+1, Radu Gheorghe wrote:

Hello Geert-Jan,

There's a difference between parent and child documents[0] and nested
documents[1]. You can even have both if you need to, although it might not
help your use case.

Both of them are under the hood separate documents, but they're different
from the searching application's POV. The nested document is treated as a
single doc (which makes it easier to search), while parent and child
documents are treated separately (which makes it easier to index/update).
I'm not sure which one's best for you.

I'll try to address your points inline.

On Wed, Dec 12, 2012 at 6:39 PM, Geert-Jan Brits gbr...@gmail.comwrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

  1. Return every hotel only once in the resultset.
  2. required in query: each query always requests a price for a
    particular <date,duration, nr of persons, roomtype>-combo. This can not be
    omitted.
  3. optional in query: filter on hotel specific info (user-rating,
    facilities)
  4. optional in query: filter on 'price' (min/max). Here 'price' is the
    price related to the required <date,duration, nr of persons, roomtype>-combo
  5. optional in query: sort on price (asc + desc) and/or sort on
    hotel-fields like rating.
  6. output: hotelid + price
  7. even better would be if I could return a 'payload' that is attached
    to the specific <date,duration, nr of persons, roomtype>-combo besides the
    price.

Currently this is implemented in such a way that each <date,duration,
nr of persons, roomtype>-combo has it's own dynamic field. However having
docs with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of
format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

  • fetch the childdocs for which 'key' matches the
    user-supplied: <date,duration, nr of persons, roomtype>
  • sort on price (of the matching childdoc)
  • filter on price (of the matching childdoc)
  • filter on some fields in the parent-doc (the hotel) such as
    hotel-rating
  • return the matching child-doc

Right. You can do that with either parent-child or nested docs.

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)

In nested documents you get your whole doc, with all sub-documents.

If you search child documents, you have the ID of the parent in the
"_parent" field of each child doc.

B. sort on hotel-fields like hotel-rating.

Again, with nested docs, it's easy. While with parent-child, you have to
do your search in the hotel (parent) docs to have results sorted by fields
in there. If you want to add pricing (children) criteria, you can add a
has_child query in there:

Elasticsearch Platform — Find real-time answers at scale | Elastic

The question:

  • are A. and B. supported in the proposed solution above?

I think they are, yes.

  • would the idea outlined above work, any caveats I should be aware of?

has_child filters and queries are run first, and then the "parent"
query/filter. IDs are loaded into memory, and since you say you have lots
of children in general, I would assume such queries will be
memory-intensive.

On the other hand, updating (which implies reindexing) nested docs with
20,000 would also be resource-intensive.

  • some other (better) way to model this?

You might want to do at least some of the combining of rooms/persons/room
type at query time using script fields:
Elasticsearch Platform — Find real-time answers at scale | Elastic

But that would make your queries slower, of course.

[0]
Elasticsearch Platform — Find real-time answers at scale | Elastic
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Would be very interested to find out if any of you guys have managed to
implement this successfully. Struggling with a similar issue at the moment.

On Thursday, 13 December 2012 00:39:07 UTC+8, Geert-Jan Brits wrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

  1. Return every hotel only once in the resultset.
  2. required in query: each query always requests a price for a
    particular <date,duration, nr of persons, roomtype>-combo. This can not be
    omitted.
  3. optional in query: filter on hotel specific info (user-rating,
    facilities)
  4. optional in query: filter on 'price' (min/max). Here 'price' is the
    price related to the required <date,duration, nr of persons, roomtype>-combo
  5. optional in query: sort on price (asc + desc) and/or sort on
    hotel-fields like rating.
  6. output: hotelid + price
  7. even better would be if I could return a 'payload' that is attached to
    the specific <date,duration, nr of persons, roomtype>-combo besides the
    price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

  • fetch the childdocs for which 'key' matches the
    user-supplied: <date,duration, nr of persons, roomtype>
  • sort on price (of the matching childdoc)
  • filter on price (of the matching childdoc)
  • filter on some fields in the parent-doc (the hotel) such as hotel-rating
  • return the matching child-doc

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)
B. sort on hotel-fields like hotel-rating.

The question:

  • are A. and B. supported in the proposed solution above?
  • would the idea outlined above work, any caveats I should be aware of?
  • some other (better) way to model this?

Thanks,
Geert-Jan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f6da93c-9639-4d58-9e9e-4ebb70f7883e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.