Hi
Context
We have a B2B product. We would like to index different types of entities (e.g. User, Training, Group, etc). It was recommended that the best approach here would be to use one engine per entity.
Problem/Question
After reading this article that described how documents are scored, I had the following questions:
- By keeping entities separate in different engines: are we somewhat disadvantaging ourselves because we won't be deriving our inverse document frequencies from the whole corpus (all the entities combined) — rather just, say, from each entity's corpus ?
At the same time, given our product is B2B and that our clients come from all sort of industries:
- is it correct to assume that we wouldn't want the rarity/inverse document frequency of each term to be based on its rarity across all of our clients' data, but only within the documents of a given client (1 client= 1 corpus) ?
- if indeed the above was a problem, would the only solution be having engines by client ? (though I would be scared that having thousands of clients would complexify the solution in this case).
Perhaps there’s a way to tell ES to calculate IDFs based on a subset of the docs in an index?
Thanks in advance.