It seems like the index-per-user approach is required here in order to make
the term frequencies accurate. If you used a shared index http://www.elastic.co/guide/en/elasticsearch/guide/current/shared-index.html
or even faked an index per user http://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html,
your TF counts for some field would reflect the index as a whole
(aggregated across the counts for each shard in that index), not just for
that user. If you tended to just query the documents for one user at a time
using some filter field, the common terms query would probably not return
the results you are expecting.
It seems like the index-per-user approach is required here in order to
make the term frequencies accurate. If you used a shared index http://www.elastic.co/guide/en/elasticsearch/guide/current/shared-index.html
or even faked an index per user http://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html,
your TF counts for some field would reflect the index as a whole
(aggregated across the counts for each shard in that index), not just for
that user. If you tended to just query the documents for one user at a time
using some filter field, the common terms query would probably not return
the results you are expecting.
Am I understanding this correctly?
I think you understand the issue perfectly, yes. cutoff_frequency is per
shard so each shard would need to contain only a single domain for the
stopwords to really work.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.