Phrase frequency in a document and in the whole collection

user3482043 · September 27, 2016, 2:17pm

Hi there,

I am (re)asking this question, as it has been already asked by other users but no response is there yet ...

The question is how to get the number of times a phrase is appeared in a specific document and in the whole collection? Here is an example:

Consider the following documents indexed by elasticsearch,

doc1: "one two three one two"
doc2: "three one two four"

I would like to get the following stats from the index:

phrase_frequency(doc1, "one two") = 2
phrase_frequency(doc2, "one two") = 1
collection_frequency("one two") = 3

I know that it is has to be done with the "span near queries", but could not find a way to get these stats.

Could someone please provide some help in this regard?
Thanks!

jpountz · September 27, 2016, 2:50pm

It depends whether you want this information for human (debugging) or machine consumption. In the former case, you could use explain that will give you the phrase freq in the explain string (it is used to compute the score). However I can't think of a way you could get the sum of the phrase freqs for all documents.

user3482043 · September 27, 2016, 3:15pm

It is for machine consumption; we need these stats to develop our own scoring model.

These stats can be obtained in Lucene by "span near queries" (even though it is not through a very elegant way); I expect to get them in elasticsearch as well.

jpountz · October 5, 2016, 9:56pm

Right, phrase queries would work too. If this is at the core of your scoring model, you might want to consider shingles too.

Lucene is a very versatile/low-level library which gives access to lots of information. On the other hand Elasticsearch is a higher level tool and it doesn't aim at doing as much as Lucene, it tries to focus on common use-cases and unfortunately I don't think this one is frequent enough to warrant inclusion in Elasticsearch.

Topic		Replies	Views
Document frequency of phrases Elasticsearch	1	764	July 6, 2017
Score based on phrase frequency only Elasticsearch	1	624	July 6, 2017
Getting phrase count for each document separately Elasticsearch	1	310	July 6, 2017
Count of phrase matches per document Elasticsearch	2	3272	September 12, 2017
Count the number of times a phrase appears in document Elasticsearch	1	358	July 6, 2017

Phrase frequency in a document and in the whole collection

Related topics