Hi Matei
I have about 50M events, each being described by tags. There are 4 tag
types (places, speakers, topics, industries).
- Tags are hierarchical, for example, "Eric Schmidt" is filed under
Google who is filed under Tech companies. So, whenever Eric is at an
event, all three tags are associated with the event.
- Different tags can have different popularity, meaning "Eric
Schmidt" would have a popularity of 100, but "Eileen Naughton" would
have a popularity of "10".
- The popularity does not apply hierarchically. That means that, if
"Eric Schmidt" would leave Google for Foursquare, his popularity would
still be 100 and Foursquare would still have popularity 50.
Now, imagine a left-hand menu with 4 sections:
Places
Paris
London
New York
[more]
Speakers
Google
Facebook
Marc Zuckerberg
[more]
and so on.
Whenever the user clicks on a tag, I want the menu to reflect the
results (faceted search). The twist is that when deciding to show
"Google" vs "Eric Schmidt" vs"Foursquare" in the first three tags in
each section, I want to make sure the most popular tag is shown
higher, based on the [number of matching events] * [tag popularity].
That means that if there are 3 matching events for "Foursquare" and
only one for "Eric Schmidt" it should show Foursquare first, with a
score of 3*50 = 150 vs Schmidt's 1 * 100.
Also, ideally, if I select "Google" then, for the "speakers" section,
the system should not return people outside Google, even if the
matching events also have "Zuckerberg" listed, with a huge popularity
of 200. So, the returned tags should reside "beneath" the current
selection in each section, and their sorting should be based on the
above scoring logic.
I hope I managed to explain what I'm trying to achieve.
OK - that makes a lot of sense now. Much easier to understand with
"real" data.
I think this will be quite easy to do using version 0.90.0.RC1 (which
will be released as stable in the near future), which has support for
sorting on values in nested documents (min,max,sum,avg).
So, index your tags as nested documents:
curl -XPUT 'http://127.0.0.1:9200/events/?pretty=1' -d '
{
"mappings" : {
"event" : {
"properties" : {
"name" : {
"type" : "string"
},
"tags" : {
"type" : "nested",
"properties" : {
"value" : {
"index" : "not_analyzed",
"type" : "string"
},
"weight" : {
"type" : "integer"
},
"type" : {
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}
}
}
'
Then, some data, eg:
curl -XPOST 'http://127.0.0.1:9200/events/event?pretty=1' -d '
{
"title" : "Paris in the springtime",
"tags" : [
{
"value" : "Paris",
"weight" : 10,
"type" : "place"
},
{
"value" : "Eric Schmidt",
"weight" : 100,
"type" : "speaker"
},
{
"value" : "Google",
"weight" : 50,
"type" : "company"
}
]
}
'
curl -XPOST 'http://127.0.0.1:9200/events/event?pretty=1' -d '
{
"title" : "Barcelona is the bollocks",
"tags" : [
{
"value" : "Barcelona",
"weight" : 30,
"type" : "place"
},
{
"value" : "Mark Zuckerberg",
"weight" : 30,
"type" : "speaker"
},
{
"value" : "Facebook",
"weight" : 40,
"type" : "company"
}
]
}
'
Now we can do our search, and sort on the sum of the weights in the
nested docs:
Look for the 'sort' value:
You can filter on (eg) just speakers from Google:
Or you can include all events, but score just on companies named
'Google':
Because we're sorting on a field value (ie 'tags.weight'), its values
need to be loaded into memory. But it has only a single value per field,
and you shouldn't run into the memory problems that you might have had
with other designs
clint
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.