Indexing documents with labels and weights


(Dustin Boswell) #1

I have a weighted set of labels for each document (e.g. "TECHNOLOGY=0.5",
"FUNNY=0.1", ...). The weight for each label is how strongly the document
has that label. Each document has a different set of labels, with different
weights. There are thousands of labels in total, although each document
might only have a few dozen with non-zero weight.

I'd like those "terms" (e.g. "TECHNOLOGY") to be indexed for that document
with that weight, so that a query for "TECHNOLOGY" would score the document
according to that label's weight. Or more generally, the query is a vector
of weights, and I'd like to score each document as the vector dot product
against the labels.

I'm aware of the "boost" feature at indexing time, although this doesn't
seem to allow per-document per-field weights. I'm also not sure how I
would index each label in this case (is the label name the field name, and
the weight the value?).

I suppose I could write a custom_score_query of some sort but I fear that
the query would get very big and the search would be very slow.

One hacky idea I had is to round each weight to the nearest 0.10, so there
are only 10 possible weights: 0.1, 0.2, 0.3 ... 1.0. And then each document
has 10 fields like:
labels_weighted_0_1: ["FUNNY"]
labels_weighted_0_5: ["TECHNOLOGY"]

And then the query would search across all 10 of these fields, where the
labels_weighted_0_1 field would always have a 0.1 boost,
labels_weighted_0_2 would always have a 0.2 boost, etc...

But I'm wondering if there is a better/simpler way. Thanks for any help or
ideas.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/728f8ab0-4dbd-462c-b2f0-95f26a90aa43%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #2