Graph search model and performance/caching problems with filtering

Hi,

I am currently prototyping a graph-like search using filtering. In my model
each document belong to a publisher and each consumer is interested in a
set of publishers. Queries are always initiated by consumers.

I currently have a single big index, where each document has a publisher_id
numeric field. Our cluster has ~60 nodes, for an index size of ~3TB / 7
billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the search
terms and also the list of publisher ids I want the results from.

The obvious solution is to simply use a filter on the publisher ids. But we
have a few problems with this:

  • we are filtering on a rather long list of publisher ids (typical between
    50 and 300) and each query is very different from the previous one
    resulting is very poor cache leverage. As soon as the filter cache fills
    up, older filters are evicted as new ones are added.

  • because of our "documents by publisher" model it is not possible to use
    routing and currently all our shards are hit on every search requests.

When we disable filtering, we have at 10X+ increase in QPS performance.

Here are a few thoughts on different models I haven't tested yet:

  • Denormalized, consumer-centric
    The idea would be to "denormalize" and store/index a duplicate of each document
    for each consumer of that publisher. This would allow to filter on a single
    consumer_id AND use routing on the consumer_id. The obvious problem is the
    impact on data growth and for that it does not seem like a viable solution.

  • Consumer ids array
    Keep a consumer ids array field for each document and filter on the consumer
    id in this field array.

  • Nested or parent-child
    Another idea would be to store/index each document once but add a child or
    nested document with the consumer_id per consumer so every published
    document would have as many nested/child documents as there is consumers
    for it.

I haven't prototyped with either the consumer ids array or the nested/parent-child
models and I wonder if it has potential of having a better performance than
my current filtering strategy?

Is there anything I am missing here? Is there any other model/strategy I
should look into for this? Any help/advises/hints appreciated!

Thanks,
Colin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

TL;DR

One easy #win would be to get feedback/advices on the best filtering
strategy:

  • currently all my 7B documents have a publisher_id and all queries are
    filtered using a list of publisher_id. this strategy proves to be really
    slow in our context (10X+ slowdown compared to no filtering) mainly because
    the list of publisher_id changes from query to query thus making the filter
    cache very inefficient.

  • would reversing the filtering, by including a list of consumer_id for
    each document and filtering on only one consumer_id against this list
    potentially perform better?

  • same reversing idea but using either nested or parent-child structure to
    add the list of consumer_id to each document.

These id lists have typically 50-300 ids.

We are currently rebuilding an index in a test cluster to try to mesure
this. We'll followup with our findings. In the meantime I'll be happy with
any feedback or suggestions!

Colin

On Thursday, February 7, 2013 4:16:29 PM UTC-5, Colin Surprenant wrote:

Hi,

I am currently prototyping a graph-like search using filtering. In my model
each document belong to a publisher and each consumer is interested in a
set of publishers. Queries are always initiated by consumers.

I currently have a single big index, where each document has a publisher_id
numeric field. Our cluster has ~60 nodes, for an index size of ~3TB / 7
billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the search
terms and also the list of publisher ids I want the results from.

The obvious solution is to simply use a filter on the publisher ids. But
we have a few problems with this:

  • we are filtering on a rather long list of publisher ids (typical between
    50 and 300) and each query is very different from the previous one
    resulting is very poor cache leverage. As soon as the filter cache fills
    up, older filters are evicted as new ones are added.

  • because of our "documents by publisher" model it is not possible to use
    routing and currently all our shards are hit on every search requests.

When we disable filtering, we have at 10X+ increase in QPS performance.

Here are a few thoughts on different models I haven't tested yet:

  • Denormalized, consumer-centric
    The idea would be to "denormalize" and store/index a duplicate of each document
    for each consumer of that publisher. This would allow to filter on a
    single consumer_id AND use routing on the consumer_id. The obvious
    problem is the impact on data growth and for that it does not seem like a
    viable solution.

  • Consumer ids array
    Keep a consumer ids array field for each document and filter on the consumer
    id in this field array.

  • Nested or parent-child
    Another idea would be to store/index each document once but add a child
    or nested document with the consumer_id per consumer so every published
    document would have as many nested/child documents as there is consumers
    for it.

I haven't prototyped with either the consumer ids array or the nested/parent-child
models and I wonder if it has potential of having a better performance
than my current filtering strategy?

Is there anything I am missing here? Is there any other model/strategy I
should look into for this? Any help/advises/hints appreciated!

Thanks,
Colin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Colin

Apologies for the late reply, but I wanted to think about this a bit...

I currently have a single big index, where each document has
a publisher_id numeric field. Our cluster has ~60 nodes, for an index
size of ~3TB / 7 billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the
search terms and also the list of publisher ids I want the results
from.

First thought. What about routing your docs on publisher ID. That at
least should reduce the number of shards you hit on each search.

The obvious solution is to simply use a filter on the publisher ids.
But we have a few problems with this:

  • we are filtering on a rather long list of publisher ids (typical
    between 50 and 300) and each query is very different from the previous
    one resulting is very poor cache leverage. As soon as the filter cache
    fills up, older filters are evicted as new ones are added.

I'm guessing that a single consumer will query the same list of
publisher IDs a few times in reasonably quick succession. What does
your query look like? Are you caching the complete filter for all
publisher IDs for a single consumer? (eg the terms filter will cache
the results for the whole filter by default, rather than the individual
publisher clauses)

  • because of our "documents by publisher" model it is not possible to
    use routing and currently all our shards are hit on every search
    requests.

Why not? If you route by publisher you should be able to.

When we disable filtering, we have at 10X+ increase in QPS
performance.

What about applying the filter after the query (eg you could use the
top-level "filter" param to search, rather than a filtered query). It
may be that your query terms just match 100 docs, in which case there
are fewer docs to filter.

Here are a few thoughts on different models I haven't tested yet:

  • Denormalized, consumer-centric
    The idea would be to "denormalize" and store/index a duplicate of
    each document for each consumer of that publisher. This would allow
    to filter on a single consumer_id AND use routing on the consumer_id.
    The obvious problem is the impact on data growth and for that it does
    not seem like a viable solution.

  • Consumer ids array
    Keep a consumer ids array field for each document and filter on
    the consumer id in this field array.

  • Nested or parent-child
    Another idea would be to store/index each document once but add
    a child or nested document with the consumer_id per consumer so
    every published document would have as many nested/child documents as
    there is consumers for it.

I haven't prototyped with either the consumer ids array or the
nested/parent-child models and I wonder if it has potential of having
a better performance than my current filtering strategy?

All are possibilities, and I'm not sure how tricky they'd be in practice
either. Certainly players like FB tend to have redundant copies of
everything as you suggest. But I'd start by seeing what you can do with
the current structure first.

Would be good to see some real world queries.

Also, what about increasing the size of the filter cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

First of all, thank you for taking the time to answer.

I will respond inline for some updates (Colin and I are working together on
the project).

On Monday, February 11, 2013 11:08:46 AM UTC-5, Clinton Gormley wrote:

Hi Colin

Apologies for the late reply, but I wanted to think about this a bit...

I currently have a single big index, where each document has
a publisher_id numeric field. Our cluster has ~60 nodes, for an index
size of ~3TB / 7 billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the
search terms and also the list of publisher ids I want the results
from.

First thought. What about routing your docs on publisher ID. That at
least should reduce the number of shards you hit on each search.

We are thinking that routing on publisher_id will be kind of useless since
the nb of shards < average of terms in the publisher_id filters normally
for a request.

The obvious solution is to simply use a filter on the publisher ids.
But we have a few problems with this:

  • we are filtering on a rather long list of publisher ids (typical
    between 50 and 300) and each query is very different from the previous
    one resulting is very poor cache leverage. As soon as the filter cache
    fills up, older filters are evicted as new ones are added.

I'm guessing that a single consumer will query the same list of
publisher IDs a few times in reasonably quick succession. What does
your query look like? Are you caching the complete filter for all
publisher IDs for a single consumer? (eg the terms filter will cache
the results for the whole filter by default, rather than the individual
publisher clauses)

Yes it's the same list, and the behavior you had described is correct.

We are caching the whole filter, we tried "or_nocache" execution and blew
off the heap a couple time. So we tried simply turning the cache to false,
but the only difference we saw is increase in latency (no effect on QPS)
We also tried "bool" execution plan without any great results.

  • because of our "documents by publisher" model it is not possible to
    use routing and currently all our shards are hit on every search
    requests.

Why not? If you route by publisher you should be able to.

Same reason as I said up there, we will end pinging all the shards on most
of the requests anyway, because of the size of the filter.

When we disable filtering, we have at 10X+ increase in QPS
performance.

What about applying the filter after the query (eg you could use the
top-level "filter" param to search, rather than a filtered query). It
may be that your query terms just match 100 docs, in which case there
are fewer docs to filter.

By using the search on the top level-filter, we are going to lose scoring
no ? because filtering has no score by default.. so this is an issue.

Moreover, I tested it, and I saw not much difference in anything...

Here are a few thoughts on different models I haven't tested yet:

  • Denormalized, consumer-centric
    The idea would be to "denormalize" and store/index a duplicate of
    each document for each consumer of that publisher. This would allow
    to filter on a single consumer_id AND use routing on the consumer_id.
    The obvious problem is the impact on data growth and for that it does
    not seem like a viable solution.

  • Consumer ids array
    Keep a consumer ids array field for each document and filter on
    the consumer id in this field array.

  • Nested or parent-child
    Another idea would be to store/index each document once but add
    a child or nested document with the consumer_id per consumer so
    every published document would have as many nested/child documents as
    there is consumers for it.

I haven't prototyped with either the consumer ids array or the
nested/parent-child models and I wonder if it has potential of having
a better performance than my current filtering strategy?

All are possibilities, and I'm not sure how tricky they'd be in practice
either. Certainly players like FB tend to have redundant copies of
everything as you suggest. But I'd start by seeing what you can do with
the current structure first.

We are currently indexing with an array of consumer_id, like the solution
#2, but the index size is about 2-3 times bigger, which was already an
issue without this...

Would be good to see some real world queries.

Also, what about increasing the size of the filter cache?

We tried doubling the filter cache size, but in our benchmark that just
delayed the moment where the filter became to be evicted.. so basically
it's just buying us some more time.
However, our benchmark is generating "random" filter from a pool of user,
so this is kind of a worst case scenario, and we'll probably be able to
tell more when we are going to have real traffic on the cluster..

Thanks again for taking time to answer us, more details to come, and more
solution are welcomed.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.