IDF per customer, many customers per index - best practices

Dear Elasticsearch Community,

There are many sources over the internet which recommend putting many
customers into one index. One example is the Shay Banon's talk given at
Berlin Buzzwords [1]. This approach has many advantages and the alternative

  • one customer per index seems like a huge over-provisioning. By using
    aliases (with the "filter" clause) its trivial to create a virtual
    namespace per customer.

There is one thing the worries me a bit tough. As per the documentation [2]

Inverse document frequency

How often does each term appear in the index? The more often, the less relevant.
Terms that appear in many documents have a lower weight than more
uncommon terms.

It seems that IDF will be calculated over the entire index. It makes sense,
because this is calculated at the index time and not at the query time. Is
this is a problem in the field? Do you know what can be the impact of other
customers' documents over a single customer doing the search? Do you have
any advices on optimizing the queries for such a use case? Any best
practices?

To sum up, I'm a bit concerned with putting many customers on a single
index, because the search ranking may be affected; but the alternative -
index per customer is not feasible because of the huge number of customer.
Do you have any hints here?

Thanks,
Igor Kupczyński

[1] https://speakerdeck.com/kimchy/elasticsearch-big-data-search-analytics
[2] Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f85af449-842d-4d11-b854-db4fcd6705f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

IDF is calculated per shard, and only in DFS search types, it is calculated
over all nodes in an initial scatter phase.

If you are concerned about IDF in a single multi-user index per aliased
user index, you should consider to index as many docs as possible into the
multi-user index. The more global docs the better. This will flatten out
skewed IDF.

Another option is to route a customer to a single shard, this will avoid
DFS search types at all to get global IDF, but does not scale for large
number of docs per user.

If you have customers with very small indexes, and they can evaluate
relevance scores, they can count IDF and may notice IDF is
misleading/wrong. In that case, to hide this skew effect, you could group
your users into users with classes of almost equal amount of docs (a "small
doc number" customers index, a "medium doc number" customers index, and
a "big doc number" customers index for example) . Also, you could try to
classify customers into users with same kind of docs (if possible at all).

If you want proficient customers to take advanced control of their
distributed scoring you would have to create an index per user and offer
DFS search types to them.

Jörg

On Fri, May 30, 2014 at 12:48 PM, Igor Kupczyński puszczyk@gmail.com
wrote:

Dear Elasticsearch Community,

There are many sources over the internet which recommend putting many
customers into one index. One example is the Shay Banon's talk given at
Berlin Buzzwords [1]. This approach has many advantages and the alternative

  • one customer per index seems like a huge over-provisioning. By using
    aliases (with the "filter" clause) its trivial to create a virtual
    namespace per customer.

There is one thing the worries me a bit tough. As per the documentation [2]

Inverse document frequency

How often does each term appear in the index? The more often, the less relevant.
Terms that appear in many documents have a lower weight than more
uncommon terms.

It seems that IDF will be calculated over the entire index. It makes
sense, because this is calculated at the index time and not at the query
time. Is this is a problem in the field? Do you know what can be the impact
of other customers' documents over a single customer doing the search? Do
you have any advices on optimizing the queries for such a use case? Any
best practices?

To sum up, I'm a bit concerned with putting many customers on a single
index, because the search ranking may be affected; but the alternative -
index per customer is not feasible because of the huge number of customer.
Do you have any hints here?

Thanks,
Igor Kupczyński

[1] https://speakerdeck.com/kimchy/elasticsearch-big-data-search-analytics
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f85af449-842d-4d11-b854-db4fcd6705f3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f85af449-842d-4d11-b854-db4fcd6705f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGdsVO9OZEMHZsimOJqC_1__0NorH1FgmvFa0VQZQoddg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jörg,

Thanks for your quick answer. I was not aware of this IDF calculation per
shard in regular queries, but it makes sense - one more scatter-gather
phase is required for the global stats. I'll probably end up with putting
many (if possible similar) customers on a single index to make "avarage"
the IDF. I do not want to go with a "customer" per index approach because,
as you mentioned, it does not scale.

Cheers,
Igor

On Friday, 30 May 2014 14:48:31 UTC+2, Jörg Prante wrote:

IDF is calculated per shard, and only in DFS search types, it is
calculated over all nodes in an initial scatter phase.

Elasticsearch Platform — Find real-time answers at scale | Elastic

If you are concerned about IDF in a single multi-user index per aliased
user index, you should consider to index as many docs as possible into the
multi-user index. The more global docs the better. This will flatten out
skewed IDF.

Another option is to route a customer to a single shard, this will avoid
DFS search types at all to get global IDF, but does not scale for large
number of docs per user.

If you have customers with very small indexes, and they can evaluate
relevance scores, they can count IDF and may notice IDF is
misleading/wrong. In that case, to hide this skew effect, you could group
your users into users with classes of almost equal amount of docs (a "small
doc number" customers index, a "medium doc number" customers index, and
a "big doc number" customers index for example) . Also, you could try to
classify customers into users with same kind of docs (if possible at all).

If you want proficient customers to take advanced control of their
distributed scoring you would have to create an index per user and offer
DFS search types to them.

Jörg

On Fri, May 30, 2014 at 12:48 PM, Igor Kupczyński <pusz...@gmail.com
<javascript:>> wrote:

Dear Elasticsearch Community,

There are many sources over the internet which recommend putting many
customers into one index. One example is the Shay Banon's talk given at
Berlin Buzzwords [1]. This approach has many advantages and the alternative

  • one customer per index seems like a huge over-provisioning. By using
    aliases (with the "filter" clause) its trivial to create a virtual
    namespace per customer.

There is one thing the worries me a bit tough. As per the documentation
[2]

Inverse document frequency

How often does each term appear in the index? The more often, the less
relevant. Terms that appear in many documents have a lower weight than
more uncommon terms.

It seems that IDF will be calculated over the entire index. It makes
sense, because this is calculated at the index time and not at the query
time. Is this is a problem in the field? Do you know what can be the impact
of other customers' documents over a single customer doing the search? Do
you have any advices on optimizing the queries for such a use case? Any
best practices?

To sum up, I'm a bit concerned with putting many customers on a single
index, because the search ranking may be affected; but the alternative -
index per customer is not feasible because of the huge number of customer.
Do you have any hints here?

Thanks,
Igor Kupczyński

[1]
https://speakerdeck.com/kimchy/elasticsearch-big-data-search-analytics
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f85af449-842d-4d11-b854-db4fcd6705f3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f85af449-842d-4d11-b854-db4fcd6705f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4321ef70-877b-4810-b198-5a3cd0d2a4b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.