For about 10,000 candidate words, I want to calculate their similarites with a user given query word based on following strategy.
-
Calculate the occurence for each candidate word and each document. This may be performed offline.
-
Find the documents contains the query word, and for each of them:
(1). Let C_i (i = 0, 1, ..., N) be the candidate words in the document, then for each occurence of the query word, calculate the partial similarity with C_i as the inverse of their position distance.
(2). Sum up all the paritial similarities for C_i, given the similarity in current document. -
For each candidate word, sum up the similarities in every document, given the final similarity.
I want to impliment this as an aggregation plugin. A query may be like this:
// The documents:
PUT /test/sim/1
{
"text": "foo bar for far ..."
"candidates":[
{1, 10}, // The first candidate word appears at position 10.
{1, 20}, // The first candidate word appears at position 20 also.
{2, 30}, // The second candidate word appears at position 30.
{5, 120}, // The fifth candidate word appears at position 120.
]
}
// the query:
GET /test/sim/_search
{
"aggr":{
"similarities":{
"coocurrence_similarity":{
"query-word": "foo"
}
}
}
}
// expacted result:
{
{ "candidate": 1, "similarity": 1.2 },
{ "candidate": 2, "similarity": 0.8 },
{ "candidate": 5, "similarity": 0.3 },
}
My questions:
- How can I get the occurences of the query word in each document?
- How can I access the "candidates" array structually, in each document?
Thank you very much!