Graph search model and performance/caching problems with filtering

colinsurprenant · February 7, 2013, 9:16pm

Hi,

I am currently prototyping a graph-like search using filtering. In my model
each document belong to a publisher and each consumer is interested in a
set of publishers. Queries are always initiated by consumers.

I currently have a single big index, where each document has a publisher_id
numeric field. Our cluster has ~60 nodes, for an index size of ~3TB / 7
billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the search
terms and also the list of publisher ids I want the results from.

The obvious solution is to simply use a filter on the publisher ids. But we
have a few problems with this:

we are filtering on a rather long list of publisher ids (typical between
50 and 300) and each query is very different from the previous one
resulting is very poor cache leverage. As soon as the filter cache fills
up, older filters are evicted as new ones are added.
because of our "documents by publisher" model it is not possible to use
routing and currently all our shards are hit on every search requests.

When we disable filtering, we have at 10X+ increase in QPS performance.

Here are a few thoughts on different models I haven't tested yet:

Denormalized, consumer-centric
The idea would be to "denormalize" and store/index a duplicate of each document
for each consumer of that publisher. This would allow to filter on a single
consumer_id AND use routing on the consumer_id. The obvious problem is the
impact on data growth and for that it does not seem like a viable solution.
Consumer ids array
Keep a consumer ids array field for each document and filter on the consumer
id in this field array.
Nested or parent-child
Another idea would be to store/index each document once but add a child or
nested document with the consumer_id per consumer so every published
document would have as many nested/child documents as there is consumers
for it.

I haven't prototyped with either the consumer ids array or the nested/parent-child
models and I wonder if it has potential of having a better performance than
my current filtering strategy?

Is there anything I am missing here? Is there any other model/strategy I
should look into for this? Any help/advises/hints appreciated!

Thanks,
Colin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

colinsurprenant · February 8, 2013, 5:08pm

TL;DR

One easy #win would be to get feedback/advices on the best filtering
strategy:

currently all my 7B documents have a publisher_id and all queries are
filtered using a list of publisher_id. this strategy proves to be really
slow in our context (10X+ slowdown compared to no filtering) mainly because
the list of publisher_id changes from query to query thus making the filter
cache very inefficient.
would reversing the filtering, by including a list of consumer_id for
each document and filtering on only one consumer_id against this list
potentially perform better?
same reversing idea but using either nested or parent-child structure to
add the list of consumer_id to each document.

These id lists have typically 50-300 ids.

We are currently rebuilding an index in a test cluster to try to mesure
this. We'll followup with our findings. In the meantime I'll be happy with
any feedback or suggestions!

Colin

On Thursday, February 7, 2013 4:16:29 PM UTC-5, Colin Surprenant wrote:

Hi,

I am currently prototyping a graph-like search using filtering. In my model
each document belong to a publisher and each consumer is interested in a
set of publishers. Queries are always initiated by consumers.

I currently have a single big index, where each document has a publisher_id
numeric field. Our cluster has ~60 nodes, for an index size of ~3TB / 7
billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the search
terms and also the list of publisher ids I want the results from.

The obvious solution is to simply use a filter on the publisher ids. But
we have a few problems with this:

we are filtering on a rather long list of publisher ids (typical between
50 and 300) and each query is very different from the previous one
resulting is very poor cache leverage. As soon as the filter cache fills
up, older filters are evicted as new ones are added.

because of our "documents by publisher" model it is not possible to use
routing and currently all our shards are hit on every search requests.

When we disable filtering, we have at 10X+ increase in QPS performance.

Here are a few thoughts on different models I haven't tested yet:

Denormalized, consumer-centric
The idea would be to "denormalize" and store/index a duplicate of each document
for each consumer of that publisher. This would allow to filter on a
single consumer_id AND use routing on the consumer_id. The obvious
problem is the impact on data growth and for that it does not seem like a
viable solution.

Consumer ids array
Keep a consumer ids array field for each document and filter on the consumer
id in this field array.

Nested or parent-child
Another idea would be to store/index each document once but add a child
or nested document with the consumer_id per consumer so every published
document would have as many nested/child documents as there is consumers
for it.

I haven't prototyped with either the consumer ids array or the nested/parent-child
models and I wonder if it has potential of having a better performance
than my current filtering strategy?

Is there anything I am missing here? Is there any other model/strategy I
should look into for this? Any help/advises/hints appreciated!

Thanks,
Colin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · February 11, 2013, 4:08pm

Hi Colin

Apologies for the late reply, but I wanted to think about this a bit...

I currently have a single big index, where each document has
a publisher_id numeric field. Our cluster has ~60 nodes, for an index
size of ~3TB / 7 billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the
search terms and also the list of publisher ids I want the results
from.

First thought. What about routing your docs on publisher ID. That at
least should reduce the number of shards you hit on each search.

The obvious solution is to simply use a filter on the publisher ids.
But we have a few problems with this:

we are filtering on a rather long list of publisher ids (typical
between 50 and 300) and each query is very different from the previous
one resulting is very poor cache leverage. As soon as the filter cache
fills up, older filters are evicted as new ones are added.

I'm guessing that a single consumer will query the same list of
publisher IDs a few times in reasonably quick succession. What does
your query look like? Are you caching the complete filter for all
publisher IDs for a single consumer? (eg the terms filter will cache
the results for the whole filter by default, rather than the individual
publisher clauses)

because of our "documents by publisher" model it is not possible to
use routing and currently all our shards are hit on every search
requests.

Why not? If you route by publisher you should be able to.

When we disable filtering, we have at 10X+ increase in QPS
performance.

What about applying the filter after the query (eg you could use the
top-level "filter" param to search, rather than a filtered query). It
may be that your query terms just match 100 docs, in which case there
are fewer docs to filter.

Here are a few thoughts on different models I haven't tested yet:

Denormalized, consumer-centric
The idea would be to "denormalize" and store/index a duplicate of
each document for each consumer of that publisher. This would allow
to filter on a single consumer_id AND use routing on the consumer_id.
The obvious problem is the impact on data growth and for that it does
not seem like a viable solution.

Consumer ids array
Keep a consumer ids array field for each document and filter on
the consumer id in this field array.

Nested or parent-child
Another idea would be to store/index each document once but add
a child or nested document with the consumer_id per consumer so
every published document would have as many nested/child documents as
there is consumers for it.

I haven't prototyped with either the consumer ids array or the
nested/parent-child models and I wonder if it has potential of having
a better performance than my current filtering strategy?

All are possibilities, and I'm not sure how tricky they'd be in practice
either. Certainly players like FB tend to have redundant copies of
everything as you suggest. But I'd start by seeing what you can do with
the current structure first.

Would be good to see some real world queries.

Also, what about increasing the size of the filter cache?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jerome_Gagnon · February 11, 2013, 8:09pm

First of all, thank you for taking the time to answer.

I will respond inline for some updates (Colin and I are working together on
the project).

On Monday, February 11, 2013 11:08:46 AM UTC-5, Clinton Gormley wrote:

Hi Colin

Apologies for the late reply, but I wanted to think about this a bit...

I currently have a single big index, where each document has
a publisher_id numeric field. Our cluster has ~60 nodes, for an index
size of ~3TB / 7 billion documents, split in 250 shards

My graph structure is handled outside ES and for each query I have the
search terms and also the list of publisher ids I want the results
from.

First thought. What about routing your docs on publisher ID. That at
least should reduce the number of shards you hit on each search.

We are thinking that routing on publisher_id will be kind of useless since
the nb of shards < average of terms in the publisher_id filters normally
for a request.

The obvious solution is to simply use a filter on the publisher ids.
But we have a few problems with this:

we are filtering on a rather long list of publisher ids (typical
between 50 and 300) and each query is very different from the previous
one resulting is very poor cache leverage. As soon as the filter cache
fills up, older filters are evicted as new ones are added.

I'm guessing that a single consumer will query the same list of
publisher IDs a few times in reasonably quick succession. What does
your query look like? Are you caching the complete filter for all
publisher IDs for a single consumer? (eg the terms filter will cache
the results for the whole filter by default, rather than the individual
publisher clauses)

Yes it's the same list, and the behavior you had described is correct.

We are caching the whole filter, we tried "or_nocache" execution and blew
off the heap a couple time. So we tried simply turning the cache to false,
but the only difference we saw is increase in latency (no effect on QPS)
We also tried "bool" execution plan without any great results.

because of our "documents by publisher" model it is not possible to
use routing and currently all our shards are hit on every search
requests.

Why not? If you route by publisher you should be able to.

Same reason as I said up there, we will end pinging all the shards on most
of the requests anyway, because of the size of the filter.

When we disable filtering, we have at 10X+ increase in QPS
performance.

What about applying the filter after the query (eg you could use the
top-level "filter" param to search, rather than a filtered query). It
may be that your query terms just match 100 docs, in which case there
are fewer docs to filter.

By using the search on the top level-filter, we are going to lose scoring
no ? because filtering has no score by default.. so this is an issue.

Moreover, I tested it, and I saw not much difference in anything...

Here are a few thoughts on different models I haven't tested yet:

Denormalized, consumer-centric
The idea would be to "denormalize" and store/index a duplicate of
each document for each consumer of that publisher. This would allow
to filter on a single consumer_id AND use routing on the consumer_id.
The obvious problem is the impact on data growth and for that it does
not seem like a viable solution.

Consumer ids array
Keep a consumer ids array field for each document and filter on
the consumer id in this field array.

Nested or parent-child
Another idea would be to store/index each document once but add
a child or nested document with the consumer_id per consumer so
every published document would have as many nested/child documents as
there is consumers for it.

I haven't prototyped with either the consumer ids array or the
nested/parent-child models and I wonder if it has potential of having
a better performance than my current filtering strategy?

All are possibilities, and I'm not sure how tricky they'd be in practice
either. Certainly players like FB tend to have redundant copies of
everything as you suggest. But I'd start by seeing what you can do with
the current structure first.

We are currently indexing with an array of consumer_id, like the solution
#2, but the index size is about 2-3 times bigger, which was already an
issue without this...

Would be good to see some real world queries.

Also, what about increasing the size of the filter cache?

We tried doubling the filter cache size, but in our benchmark that just
delayed the moment where the filter became to be evicted.. so basically
it's just buying us some more time.
However, our benchmark is generating "random" filter from a pool of user,
so this is kind of a worst case scenario, and we'll probably be able to
tell more when we are going to have real traffic on the cluster..

Thanks again for taking time to answer us, more details to come, and more
solution are welcomed.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Architecture and performance question on searching small subsets of documents Elasticsearch	4	390	July 6, 2017
Way to factor in search locality? Elasticsearch	7	960	July 5, 2017
Index-Per-User Scale Elasticsearch	6	4671	July 6, 2017
Index and query performance Elasticsearch	1	343	May 26, 2020
Data modelling to avoid reprocessing Elasticsearch	2	369	February 9, 2022

Graph search model and performance/caching problems with filtering

Related topics