Is it possible to create a percolator type procedure that passes documents from index A and conducts a full-text similarity query against documents in index B, whereby only k number of nearest-neighbour documents are retrieved from B. The documents in A are then classified based on 'majority' vote of the retrieved k documents from index B.
Hi Martijn! Do you think it is possible to write a script that whenever a new document is indexed I use the more_like_this query to retrieve k nearest neighbours, and to do it that way? I'm new to elasticsearch and my programming skills aren't that advanced as my background is in statistics...
Yes, that is possible. Just make sure that before you run your script, that you've refreshed the index, otherwise the newly indexed document isn't visible in the search api.
I presume when you run this free-text query and assessing the "majority vote" that you are analysing some existing structured classification field eg "tag" or "category" to assess the most relevant tag.
You can use aggregations to do this but I have 2 tips:
use the 'sampler' aggregation to consider only the top N results from the MLT query
use the 'significant_terms' aggregation instead of the 'terms' aggregation to get the top N tags. Popular tags like "software" are perhaps less interesting than "search engine". Significant terms sniffs these more important classifications out.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.