I have a couple questions about the term look filter:
Using the twitter examplehttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_twitter_examplefrom the documentation, let's say a user is a "star" on twitter. He could
have 1 million followers and add thousands daily. Would there be an issue
filtering by that many values? Also updating the filter document at that
point seems like it would start to get expensive. Is there any built in
solution to this? I would think one way is to try to chunk followers into
smaller increments and do multiple filtered queries.
The documentation says "Also, consider using an index with a single
shard and fully replicated across all nodes if the "reference" terms data
is not large." I understand the fully replicated reasoning, but why does
using a single shard matter? Since a get is done, even if there
are multiple shards, elastic search can determine a single shard on which
to do the lookup, right?
I agree with your assessment, and there is no mechanism to optimize
for this. There will be some overhead to filtering so many terms (although
it's difficult to say how much), it will generate more churn on the filter
cache, and updates to that single document may become quite expensive (due
to race conditions most likely). Because ES uses optimistic concurrency,
updates to that doc may have to retry many times before they can succeed.
In this situation, I may actually think about flipping the data around.
Instead of a single document that says "who follows me", I might consider
changing it to a "who I follow" relationship and embedding that into
individual documents. So each user document now maintains a list of "who I
follow". Updates are scattered around multiple documents, reducing
contention. When you need to find "all followers of user X", you can
simply do a "term" : {"user" : "X"} filter across all the user docs.
Same set of results, but now you're only doing a single term filter
instead of a terms filter with millions of values.
You are correct that ES will go to the relevant shard and ignore the
others when doing a GET.
A single shard is used for simplicity and memory reduction. If the
index is fully replicated, all nodes have all the data regardless of how
many primary shards. By making a single shard, you basically just reduce
overhead. Each shard comes with some amount of memory footprint (in the
form of dictionaries, bloom filters, etc). Five primary shards that are
fully replicated will be no faster (or slower) than one primary shard that
is fully replicated, but it will eat up more memory.
On Wednesday, March 12, 2014 9:55:36 AM UTC-4, slushi wrote:
I have a couple questions about the term look filter:
Using the twitter examplehttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_twitter_examplefrom the documentation, let's say a user is a "star" on twitter. He could
have 1 million followers and add thousands daily. Would there be an issue
filtering by that many values? Also updating the filter document at that
point seems like it would start to get expensive. Is there any built in
solution to this? I would think one way is to try to chunk followers into
smaller increments and do multiple filtered queries.
The documentation says "Also, consider using an index with a single
shard and fully replicated across all nodes if the "reference" terms data
is not large." I understand the fully replicated reasoning, but why does
using a single shard matter? Since a get is done, even if there
are multiple shards, Elasticsearch can determine a single shard on which
to do the lookup, right?
As I understand it, the example from the documentation tries to demonstrate
searching within all the tweets of a user's followers. It seems like the
data structure change you propose would make it easy to find all the
followers, but wouldn't help when trying to search tweets of all of a
user's followers, correct? There it seems like you still need this single
document with a list of all followers.
For point 2, your explanation makes perfect sense. Thanks!
On Wednesday, March 12, 2014 3:23:49 PM UTC-4, Zachary Tong wrote:
I agree with your assessment, and there is no mechanism to optimize
for this. There will be some overhead to filtering so many terms (although
it's difficult to say how much), it will generate more churn on the filter
cache, and updates to that single document may become quite expensive (due
to race conditions most likely). Because ES uses optimistic concurrency,
updates to that doc may have to retry many times before they can succeed.
In this situation, I may actually think about flipping the data
around. Instead of a single document that says "who follows me", I might
consider changing it to a "who I follow" relationship and embedding that
into individual documents. So each user document now maintains a list of
"who I follow". Updates are scattered around multiple documents, reducing
contention. When you need to find "all followers of user X", you can
simply do a "term" : {"user" : "X"} filter across all the user docs.
Same set of results, but now you're only doing a single term filter
instead of a terms filter with millions of values.
You are correct that ES will go to the relevant shard and ignore
the others when doing a GET.
A single shard is used for simplicity and memory reduction. If the
index is fully replicated, all nodes have all the data regardless of how
many primary shards. By making a single shard, you basically just reduce
overhead. Each shard comes with some amount of memory footprint (in the
form of dictionaries, bloom filters, etc). Five primary shards that are
fully replicated will be no faster (or slower) than one primary shard that
is fully replicated, but it will eat up more memory.
On Wednesday, March 12, 2014 9:55:36 AM UTC-4, slushi wrote:
I have a couple questions about the term look filter:
Using the twitter examplehttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_twitter_examplefrom the documentation, let's say a user is a "star" on twitter. He could
have 1 million followers and add thousands daily. Would there be an issue
filtering by that many values? Also updating the filter document at that
point seems like it would start to get expensive. Is there any built in
solution to this? I would think one way is to try to chunk followers into
smaller increments and do multiple filtered queries.
The documentation says "Also, consider using an index with a
single shard and fully replicated across all nodes if the "reference" terms
data is not large." I understand the fully replicated reasoning, but why
does using a single shard matter? Since a get is done, even if there
are multiple shards, Elasticsearch can determine a single shard on which
to do the lookup, right?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.