Is it possible to create a percolator type procedure that passes documents from index A and conducts a full-text similarity query against documents in index B, whereby only k number of nearest-neighbour documents are retrieved from B. The documents in A are then classified based on 'majority' vote of the retrieved k documents from index B.
The percolator can only evaluate a single document at the time. So I think that what you like to do cannot be done with the percolator.
Hi Martijn! Do you think it is possible to write a script that whenever a new document is indexed I use the more_like_this query to retrieve k nearest neighbours, and to do it that way? I'm new to elasticsearch and my programming skills aren't that advanced as my background is in statistics...
Greetings from London
Yes, that is possible. Just make sure that before you run your script, that you've refreshed the index, otherwise the newly indexed document isn't visible in the search api.
I presume when you run this free-text query and assessing the "majority vote" that you are analysing some existing structured classification field eg "tag" or "category" to assess the most relevant tag.
You can use aggregations to do this but I have 2 tips:
- use the 'sampler' aggregation to consider only the top N results from the MLT query
- use the 'significant_terms' aggregation instead of the 'terms' aggregation to get the top N tags. Popular tags like "software" are perhaps less interesting than "search engine". Significant terms sniffs these more important classifications out.