First of all, thank you for taking the time to answer.
I will respond inline for some updates (Colin and I are working together on
On Monday, February 11, 2013 11:08:46 AM UTC-5, Clinton Gormley wrote:
Apologies for the late reply, but I wanted to think about this a bit...
I currently have a single big index, where each document has
a publisher_id numeric field. Our cluster has ~60 nodes, for an index
size of ~3TB / 7 billion documents, split in 250 shards
My graph structure is handled outside ES and for each query I have the
search terms and also the list of publisher ids I want the results
First thought. What about routing your docs on publisher ID. That at
least should reduce the number of shards you hit on each search.
We are thinking that routing on publisher_id will be kind of useless since
the nb of shards < average of terms in the publisher_id filters normally
for a request.
The obvious solution is to simply use a filter on the publisher ids.
But we have a few problems with this:
- we are filtering on a rather long list of publisher ids (typical
between 50 and 300) and each query is very different from the previous
one resulting is very poor cache leverage. As soon as the filter cache
fills up, older filters are evicted as new ones are added.
I'm guessing that a single consumer will query the same list of
publisher IDs a few times in reasonably quick succession. What does
your query look like? Are you caching the complete filter for all
publisher IDs for a single consumer? (eg the terms filter will cache
the results for the whole filter by default, rather than the individual
Yes it's the same list, and the behavior you had described is correct.
We are caching the whole filter, we tried "or_nocache" execution and blew
off the heap a couple time. So we tried simply turning the cache to false,
but the only difference we saw is increase in latency (no effect on QPS)
We also tried "bool" execution plan without any great results.
- because of our "documents by publisher" model it is not possible to
use routing and currently all our shards are hit on every search
Why not? If you route by publisher you should be able to.
Same reason as I said up there, we will end pinging all the shards on most
of the requests anyway, because of the size of the filter.
When we disable filtering, we have at 10X+ increase in QPS
What about applying the filter after the query (eg you could use the
top-level "filter" param to search, rather than a filtered query). It
may be that your query terms just match 100 docs, in which case there
are fewer docs to filter.
By using the search on the top level-filter, we are going to lose scoring
no ? because filtering has no score by default.. so this is an issue.
Moreover, I tested it, and I saw not much difference in anything...
Here are a few thoughts on different models I haven't tested yet:
The idea would be to "denormalize" and store/index a duplicate of
each document for each consumer of that publisher. This would allow
to filter on a single consumer_id AND use routing on the consumer_id.
The obvious problem is the impact on data growth and for that it does
not seem like a viable solution.
Consumer ids array
Keep a consumer ids array field for each document and filter on
the consumer id in this field array.
Nested or parent-child
Another idea would be to store/index each document once but add
a child or nested document with the consumer_id per consumer so
every published document would have as many nested/child documents as
there is consumers for it.
I haven't prototyped with either the consumer ids array or the
nested/parent-child models and I wonder if it has potential of having
a better performance than my current filtering strategy?
All are possibilities, and I'm not sure how tricky they'd be in practice
either. Certainly players like FB tend to have redundant copies of
everything as you suggest. But I'd start by seeing what you can do with
the current structure first.
We are currently indexing with an array of consumer_id, like the solution
#2, but the index size is about 2-3 times bigger, which was already an
issue without this...
Would be good to see some real world queries.
Also, what about increasing the size of the filter cache?
We tried doubling the filter cache size, but in our benchmark that just
delayed the moment where the filter became to be evicted.. so basically
it's just buying us some more time.
However, our benchmark is generating "random" filter from a pool of user,
so this is kind of a worst case scenario, and we'll probably be able to
tell more when we are going to have real traffic on the cluster..
Thanks again for taking time to answer us, more details to come, and more
solution are welcomed.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.