I feel some background may be needed:
I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)
Requirements:
- Return every hotel only once in the resultset.
- required in query: each query always requests a price for a
particular <date,duration, nr of persons, roomtype>-combo. This can not be
omitted. - optional in query: filter on hotel specific info (user-rating,
facilities) - optional in query: filter on 'price' (min/max). Here 'price' is the
price related to the required <date,duration, nr of persons, roomtype>-combo - optional in query: sort on price (asc + desc) and/or sort on
hotel-fields like rating. - output: hotelid + price
- even better would be if I could return a 'payload' that is attached to
the specific <date,duration, nr of persons, roomtype>-combo besides the
price.
Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)
I'm currently investigating some other approaches:
By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)
Reading up on ES however, perhaps using the nested type would work?
(http://www.elasticsearch.org/guide/reference/mapping/nested-type.html)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}
Then using the 'has_parent' query
(http://www.elasticsearch.org/guide/reference/query-dsl/has-parent-query.html)
I can (please correct if wrong)
- fetch the childdocs for which 'key' matches the
user-supplied: <date,duration, nr of persons, roomtype> - sort on price (of the matching childdoc)
- filter on price (of the matching childdoc)
- filter on some fields in the parent-doc (the hotel) such as hotel-rating
- return the matching child-doc
However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)
B. sort on hotel-fields like hotel-rating.
The question:
- are A. and B. supported in the proposed solution above?
- would the idea outlined above work, any caveats I should be aware of?
- some other (better) way to model this?
Thanks,
Geert-Jan
--