Hi,
I am currently prototyping a graph-like search using filtering. In my model
each document belong to a publisher and each consumer is interested in a
set of publishers. Queries are always initiated by consumers.
I currently have a single big index, where each document has a publisher_id
numeric field. Our cluster has ~60 nodes, for an index size of ~3TB / 7
billion documents, split in 250 shards
My graph structure is handled outside ES and for each query I have the search
terms and also the list of publisher ids I want the results from.
The obvious solution is to simply use a filter on the publisher ids. But we
have a few problems with this:
-
we are filtering on a rather long list of publisher ids (typical between
50 and 300) and each query is very different from the previous one
resulting is very poor cache leverage. As soon as the filter cache fills
up, older filters are evicted as new ones are added. -
because of our "documents by publisher" model it is not possible to use
routing and currently all our shards are hit on every search requests.
When we disable filtering, we have at 10X+ increase in QPS performance.
Here are a few thoughts on different models I haven't tested yet:
-
Denormalized, consumer-centric
The idea would be to "denormalize" and store/index a duplicate of each document
for each consumer of that publisher. This would allow to filter on a single
consumer_id AND use routing on the consumer_id. The obvious problem is the
impact on data growth and for that it does not seem like a viable solution. -
Consumer ids array
Keep a consumer ids array field for each document and filter on the consumer
id in this field array. -
Nested or parent-child
Another idea would be to store/index each document once but add a child or
nested document with the consumer_id per consumer so every published
document would have as many nested/child documents as there is consumers
for it.
I haven't prototyped with either the consumer ids array or the nested/parent-child
models and I wonder if it has potential of having a better performance than
my current filtering strategy?
Is there anything I am missing here? Is there any other model/strategy I
should look into for this? Any help/advises/hints appreciated!
Thanks,
Colin
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.