Index-per-user required for common terms query and cutoff_frequency?


(Loren Siebert) #1

The docs
http://www.elastic.co/guide/en/elasticsearch/guide/current/common-terms.html
mention that "One of the benefits of cutoff_frequency is that you get
domain-specific stopwords for free."

It seems like the index-per-user approach is required here in order to make
the term frequencies accurate. If you used a shared index
http://www.elastic.co/guide/en/elasticsearch/guide/current/shared-index.html
or even faked an index per user
http://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html,
your TF counts for some field would reflect the index as a whole
(aggregated across the counts for each shard in that index), not just for
that user. If you tended to just query the documents for one user at a time
using some filter field, the common terms query would probably not return
the results you are expecting.

Am I understanding this correctly?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/398cfc81-ba3e-458c-840f-aee5e94902c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #2

On Wed, Apr 29, 2015 at 2:53 PM, Loren loren@siebert.org wrote:

The docs
http://www.elastic.co/guide/en/elasticsearch/guide/current/common-terms.html
mention that "One of the benefits of cutoff_frequency is that you get
domain-specific stopwords for free."

It seems like the index-per-user approach is required here in order to
make the term frequencies accurate. If you used a shared index
http://www.elastic.co/guide/en/elasticsearch/guide/current/shared-index.html
or even faked an index per user
http://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html,
your TF counts for some field would reflect the index as a whole
(aggregated across the counts for each shard in that index), not just for
that user. If you tended to just query the documents for one user at a time
using some filter field, the common terms query would probably not return
the results you are expecting.

Am I understanding this correctly?

I think you understand the issue perfectly, yes. cutoff_frequency is per
shard so each shard would need to contain only a single domain for the
stopwords to really work.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1m7xk_Hq36i%2BA7aRFsdinaAX1dJ%3DUa%2BL9qkB%3DjKwLDjg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3