Given the removal of mapping types in Elasticsearch 6.0, we are looking to use one index per document type. Following along with "The Definitive Guide", these will be shared indices, with routing per client, so smaller clients aren't spread across a large number of shards. We're wondering what are the implications of inverse document frequency scores, if there are dramatically different document counts per client, that happen to reside on the same shard? For example…
Say we have a client named Red and a client named Blue that happen to reside on the same shard in each type of index due to routing. We have one index for letters and one for email. Red has a relatively small number of both letters and email. Blue has a relatively large number of letters, but they don't use the email feature, so a small number of email. Given that inverse document frequency is based on all records within a shard (ignoring routing and filtering), if a user for Red searches for the word "Blue", the amount of letters within the shard containing "Blue" will be substantial. This will result in Red's score for letters to be tainted lower, pushing email to the top of Red's results because of Blue's data.
Am I understanding this correctly, and is this a substantial problem to worry about? Is there a way to mitigate the problem so clients don't dramatically affect each other's results? For larger clients, we intent to host them on dedicated indices, but there are still small clients that have dramatically different record sizes, relative to each other.
Thank you for any assistance you can offer!!