Phrase frequency in a document and in the whole collection

Hi there,

I am (re)asking this question, as it has been already asked by other users but no response is there yet ...

The question is how to get the number of times a phrase is appeared in a specific document and in the whole collection? Here is an example:

Consider the following documents indexed by elasticsearch,

doc1: "one two three one two"
doc2: "three one two four"

I would like to get the following stats from the index:

phrase_frequency(doc1, "one two") = 2
phrase_frequency(doc2, "one two") = 1
collection_frequency("one two") = 3

I know that it is has to be done with the "span near queries", but could not find a way to get these stats.

Could someone please provide some help in this regard?

It depends whether you want this information for human (debugging) or machine consumption. In the former case, you could use explain that will give you the phrase freq in the explain string (it is used to compute the score). However I can't think of a way you could get the sum of the phrase freqs for all documents.

It is for machine consumption; we need these stats to develop our own scoring model.

These stats can be obtained in Lucene by "span near queries" (even though it is not through a very elegant way); I expect to get them in elasticsearch as well.

Right, phrase queries would work too. If this is at the core of your scoring model, you might want to consider shingles too.

Lucene is a very versatile/low-level library which gives access to lots of information. On the other hand Elasticsearch is a higher level tool and it doesn't aim at doing as much as Lucene, it tries to focus on common use-cases and unfortunately I don't think this one is frequent enough to warrant inclusion in Elasticsearch.