Finding good initial Elasticsearch decay parameters


#1

I'm building a system that tries to optimize the finding of duplicate bug reports using ElasticSearch.

Aside the textual similarity I want to boost the score by using a decay function based on the time between the queried bug report and the documents (other bug reports).

I've already performed an initial analysis on my dataset and discovered that relevant bug reports seem to follow a particular time pattern related to there original bug report.

This shows that bug reports that are posted closer together in time, have a higher chance of being a duplicate (relevant) than others.The graph above uses steps of 10 days.

However using steps of 1 day looks like the following:

With one large peak for the first day and it then becoming more linear. I'm however not sure if this is the correct curve to use and how I could use it to still leverage the importance of the first day.

In addition, I've read that decay functions need 4 parameters: Origin, offset, decay and scale. As origin I believe I must use the date of the queried bug report. However I'm wondering what the other parameters should be. I'm aware that this has to be found in an iterative way where one tries different settings, but I guess a good initial guess is important.

My questions are the following:

  • What would be the best fitting curve for my scenario
    (gauss/linear/exp)?
  • What would be a good initial parameter setting to
    test?
  • What would be a good way to incorporate this addition into the
    score? I've read that multiplying it with the score would make this
    too heavy weighted, since this addition should only influence the score by a
    little.

Thank you.


(system) #2