Term lookup filter

I have a couple questions about the term look filter:

  1. Using the twitter examplehttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_twitter_examplefrom the documentation, let's say a user is a "star" on twitter. He could
    have 1 million followers and add thousands daily. Would there be an issue
    filtering by that many values? Also updating the filter document at that
    point seems like it would start to get expensive. Is there any built in
    solution to this? I would think one way is to try to chunk followers into
    smaller increments and do multiple filtered queries.
  2. The documentation says "Also, consider using an index with a single
    shard and fully replicated across all nodes if the "reference" terms data
    is not large." I understand the fully replicated reasoning, but why does
    using a single shard matter? Since a get is done, even if there
    are multiple shards, elastic search can determine a single shard on which
    to do the lookup, right?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b5a57c2-bdef-4cb0-b03a-d5f5e979c57e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

  1. I agree with your assessment, and there is no mechanism to optimize
    for this. There will be some overhead to filtering so many terms (although
    it's difficult to say how much), it will generate more churn on the filter
    cache, and updates to that single document may become quite expensive (due
    to race conditions most likely). Because ES uses optimistic concurrency,
    updates to that doc may have to retry many times before they can succeed.

In this situation, I may actually think about flipping the data around.
Instead of a single document that says "who follows me", I might consider
changing it to a "who I follow" relationship and embedding that into
individual documents. So each user document now maintains a list of "who I
follow". Updates are scattered around multiple documents, reducing
contention. When you need to find "all followers of user X", you can
simply do a "term" : {"user" : "X"} filter across all the user docs.
Same set of results, but now you're only doing a single term filter
instead of a terms filter with millions of values.

  1. You are correct that ES will go to the relevant shard and ignore the
    others when doing a GET.

A single shard is used for simplicity and memory reduction. If the
index is fully replicated, all nodes have all the data regardless of how
many primary shards. By making a single shard, you basically just reduce
overhead. Each shard comes with some amount of memory footprint (in the
form of dictionaries, bloom filters, etc). Five primary shards that are
fully replicated will be no faster (or slower) than one primary shard that
is fully replicated, but it will eat up more memory.

On Wednesday, March 12, 2014 9:55:36 AM UTC-4, slushi wrote:

I have a couple questions about the term look filter:

  1. Using the twitter examplehttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_twitter_examplefrom the documentation, let's say a user is a "star" on twitter. He could
    have 1 million followers and add thousands daily. Would there be an issue
    filtering by that many values? Also updating the filter document at that
    point seems like it would start to get expensive. Is there any built in
    solution to this? I would think one way is to try to chunk followers into
    smaller increments and do multiple filtered queries.
  2. The documentation says "Also, consider using an index with a single
    shard and fully replicated across all nodes if the "reference" terms data
    is not large." I understand the fully replicated reasoning, but why does
    using a single shard matter? Since a get is done, even if there
    are multiple shards, Elasticsearch can determine a single shard on which
    to do the lookup, right?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4f1c383e-80c8-4c79-8633-63e9f1441ba4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

As I understand it, the example from the documentation tries to demonstrate
searching within all the tweets of a user's followers. It seems like the
data structure change you propose would make it easy to find all the
followers, but wouldn't help when trying to search tweets of all of a
user's followers, correct? There it seems like you still need this single
document with a list of all followers.

For point 2, your explanation makes perfect sense. Thanks!

On Wednesday, March 12, 2014 3:23:49 PM UTC-4, Zachary Tong wrote:

  1. I agree with your assessment, and there is no mechanism to optimize
    for this. There will be some overhead to filtering so many terms (although
    it's difficult to say how much), it will generate more churn on the filter
    cache, and updates to that single document may become quite expensive (due
    to race conditions most likely). Because ES uses optimistic concurrency,
    updates to that doc may have to retry many times before they can succeed.

In this situation, I may actually think about flipping the data
around. Instead of a single document that says "who follows me", I might
consider changing it to a "who I follow" relationship and embedding that
into individual documents. So each user document now maintains a list of
"who I follow". Updates are scattered around multiple documents, reducing
contention. When you need to find "all followers of user X", you can
simply do a "term" : {"user" : "X"} filter across all the user docs.
Same set of results, but now you're only doing a single term filter
instead of a terms filter with millions of values.

  1. You are correct that ES will go to the relevant shard and ignore
    the others when doing a GET.

A single shard is used for simplicity and memory reduction. If the
index is fully replicated, all nodes have all the data regardless of how
many primary shards. By making a single shard, you basically just reduce
overhead. Each shard comes with some amount of memory footprint (in the
form of dictionaries, bloom filters, etc). Five primary shards that are
fully replicated will be no faster (or slower) than one primary shard that
is fully replicated, but it will eat up more memory.

On Wednesday, March 12, 2014 9:55:36 AM UTC-4, slushi wrote:

I have a couple questions about the term look filter:

  1. Using the twitter examplehttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_twitter_examplefrom the documentation, let's say a user is a "star" on twitter. He could
    have 1 million followers and add thousands daily. Would there be an issue
    filtering by that many values? Also updating the filter document at that
    point seems like it would start to get expensive. Is there any built in
    solution to this? I would think one way is to try to chunk followers into
    smaller increments and do multiple filtered queries.
  2. The documentation says "Also, consider using an index with a
    single shard and fully replicated across all nodes if the "reference" terms
    data is not large." I understand the fully replicated reasoning, but why
    does using a single shard matter? Since a get is done, even if there
    are multiple shards, Elasticsearch can determine a single shard on which
    to do the lookup, right?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/45f28398-bb92-4e95-a0c1-e91151691a3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.