I have a test set of data that includes a search string and the document id of the document that would be of highest relevance for that search string. Is there a specific way I can go about improving my query (currently just a multi match across multiple fields) so that that the most relevant documents are returning higher in my results?
Right now I'm just randomly picking boosts and cutoff_frequency's and running my training set through queries to see which query I've randomly created gives me the best result. Is there a more optimal way I could be doing this?
This is a complex topic, some have even written books on it
What you have is close to what's known as a judgment list: a set of graded documents for each query. There's a lot of standard metrics for taking a judgment list and coming up with a number on how good the results are:
On the solution - If you have good metrics you could trust, you could do a grid search on a set of parameters on your current query strategy.
BUT you'll only do good with proportion to the quality of the underlying queries. Just like machine learning is only as good as the underlying features. And that's the hard stuff people spend years on both inside and outside the search engine with complex enrichment of docs and queries. What you need to do is try to craft good ranking-time signals that turn a relevance score into something closer to what users care about when it comes to relevance, see here:
IF you have good enough signals AND you have a lot of high quality judgments, you MIGHT be in a position where you could turn the ranking optimization into a machine learning problem:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.