This is a complex topic, some have even written books on it
What you have is close to what's known as a judgment list: a set of graded documents for each query. There's a lot of standard metrics for taking a judgment list and coming up with a number on how good the results are:
You can also use tools that are built to use judgment lists and evaluate the quality of a search relevance solution:
On the solution - If you have good metrics you could trust, you could do a grid search on a set of parameters on your current query strategy.
BUT you'll only do good with proportion to the quality of the underlying queries. Just like machine learning is only as good as the underlying features. And that's the hard stuff people spend years on both inside and outside the search engine with complex enrichment of docs and queries. What you need to do is try to craft good ranking-time signals that turn a relevance score into something closer to what users care about when it comes to relevance, see here:
IF you have good enough signals AND you have a lot of high quality judgments, you MIGHT be in a position where you could turn the ranking optimization into a machine learning problem:
So I'm not sure if that helps other than just opens a pandoras box of stuff to learn about...