I was interested in fetching similar documents for a given input document (similar to KNN). Since I'm dealing with only text fields, I went ahead with using more_like_this query, which does the job for text fields.
But I was concerned about the performance when I have millions of documents indexed in ES. The documentation says that using term_vector to store the term vectors at the index time can speed up the analysis.
But what I don't understand is which type of term vector the documentation refers to in this context. As there are three different types of term vectors: term information, term statistics, and field statistics.
And term statistics and field statistics compute the frequency of the terms with respect to other documents in the index, wouldn't these vectors be outdated when I introduce new documents in the index.
Hence I presume that the more_like_this documentation refers to the term information (which is the information of the terms in one particular document irrespective of the others).
Can anyone let me know if computing only the term information vector at the index time is sufficient to speed up more_like_this?
Also, it would be helpful if there's a performance evaluation report/stats for "more_like_this".