Particularly large result sets

Hello list,

I am working on a project where we will be modeling 1-n relations with
potentially hundreds of thousands of endpoints for a single
relationship.

We are currently storing the data in Riak, and this seems to work
fine. But the relations are of course a problem.

Riak links are implemented in a way that limits the number of links on
a single object to few thousands, so I figured I could use Riak Search
to search for relationship documents of the form "user_id: ...,
follower_id: ..."

However, it seems that Riak Search scales pretty badly when you get
hundreds of thousands of hits. Example: Counting the number of hits
when they are 100.000 takes 10 times as longs as counting the number
of hits when they are 10.000. Not too surprising, perhaps, but it
makes the count take several seconds for big relations.

So, I am thinking, can I do this with elasticsearch, or some other
lucene based engine?

This makes the question I would need an answer to this: Is counting
hits for a search having hundreds of thousands of hits in
elasticsearch a costly operation? Does it scale as linearly as in Riak
Search?

regards,
//Martin

Hi Martin,

The more docs a query matches the slower things get.
The more hits you actually pull from the results (e.g. for displaying
or highlighting purposes) the slower things get.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Feb 23, 2:02 pm, Martin Bruse zondol...@gmail.com wrote:

Hello list,

I am working on a project where we will be modeling 1-n relations with
potentially hundreds of thousands of endpoints for a single
relationship.

We are currently storing the data in Riak, and this seems to work
fine. But the relations are of course a problem.

Riak links are implemented in a way that limits the number of links on
a single object to few thousands, so I figured I could use Riak Search
to search for relationship documents of the form "user_id: ...,
follower_id: ..."

However, it seems that Riak Search scales pretty badly when you get
hundreds of thousands of hits. Example: Counting the number of hits
when they are 100.000 takes 10 times as longs as counting the number
of hits when they are 10.000. Not too surprising, perhaps, but it
makes the count take several seconds for big relations.

So, I am thinking, can I do this with elasticsearch, or some other
lucene based engine?

This makes the question I would need an answer to this: Is counting
hits for a search having hundreds of thousands of hits in
elasticsearch a costly operation? Does it scale as linearly as in Riak
Search?

regards,
//Martin

If I don't pull all hits, just paginate through the first bit, and count
them, I am basically doing what I do when I google something with millions
of hits. I wonder how Google does it so fast... :expressionless:
On Feb 24, 2012 2:33 AM, "Otis Gospodnetic" otis.gospodnetic@gmail.com
wrote:

Hi Martin,

The more docs a query matches the slower things get.
The more hits you actually pull from the results (e.g. for displaying
or highlighting purposes) the slower things get.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Feb 23, 2:02 pm, Martin Bruse zondol...@gmail.com wrote:

Hello list,

I am working on a project where we will be modeling 1-n relations with
potentially hundreds of thousands of endpoints for a single
relationship.

We are currently storing the data in Riak, and this seems to work
fine. But the relations are of course a problem.

Riak links are implemented in a way that limits the number of links on
a single object to few thousands, so I figured I could use Riak Search
to search for relationship documents of the form "user_id: ...,
follower_id: ..."

However, it seems that Riak Search scales pretty badly when you get
hundreds of thousands of hits. Example: Counting the number of hits
when they are 100.000 takes 10 times as longs as counting the number
of hits when they are 10.000. Not too surprising, perhaps, but it
makes the count take several seconds for big relations.

So, I am thinking, can I do this with elasticsearch, or some other
lucene based engine?

This makes the question I would need an answer to this: Is counting
hits for a search having hundreds of thousands of hits in
elasticsearch a costly operation? Does it scale as linearly as in Riak
Search?

regards,
//Martin

Heya,

ES scales very well. Counting for 10k, 100k or 1m docs will take 100 ms or less.

So if you use standard queries, you will get back as Google the number of hits and the 10 first relevant docs.

HTH
David :wink:
@dadoonet

Le 24 févr. 2012 à 07:04, Martin Bruse zondolfin@gmail.com a écrit :

If I don't pull all hits, just paginate through the first bit, and count them, I am basically doing what I do when I google something with millions of hits. I wonder how Google does it so fast... :expressionless:

On Feb 24, 2012 2:33 AM, "Otis Gospodnetic" otis.gospodnetic@gmail.com wrote:
Hi Martin,

The more docs a query matches the slower things get.
The more hits you actually pull from the results (e.g. for displaying
or highlighting purposes) the slower things get.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Feb 23, 2:02 pm, Martin Bruse zondol...@gmail.com wrote:

Hello list,

I am working on a project where we will be modeling 1-n relations with
potentially hundreds of thousands of endpoints for a single
relationship.

We are currently storing the data in Riak, and this seems to work
fine. But the relations are of course a problem.

Riak links are implemented in a way that limits the number of links on
a single object to few thousands, so I figured I could use Riak Search
to search for relationship documents of the form "user_id: ...,
follower_id: ..."

However, it seems that Riak Search scales pretty badly when you get
hundreds of thousands of hits. Example: Counting the number of hits
when they are 100.000 takes 10 times as longs as counting the number
of hits when they are 10.000. Not too surprising, perhaps, but it
makes the count take several seconds for big relations.

So, I am thinking, can I do this with elasticsearch, or some other
lucene based engine?

This makes the question I would need an answer to this: Is counting
hits for a search having hundreds of thousands of hits in
elasticsearch a costly operation? Does it scale as linearly as in Riak
Search?

regards,
//Martin

Thank you, this is what I wanted to hear :slight_smile:
On Feb 24, 2012 7:16 AM, "David Pilato" david@pilato.fr wrote:

Heya,

ES scales very well. Counting for 10k, 100k or 1m docs will take 100 ms or
less.

So if you use standard queries, you will get back as Google the number
of hits and the 10 first relevant docs.

HTH
David :wink:
@dadoonet

Le 24 févr. 2012 à 07:04, Martin Bruse zondolfin@gmail.com a écrit :

If I don't pull all hits, just paginate through the first bit, and count
them, I am basically doing what I do when I google something with millions
of hits. I wonder how Google does it so fast... :expressionless:
On Feb 24, 2012 2:33 AM, "Otis Gospodnetic" otis.gospodnetic@gmail.com
wrote:

Hi Martin,

The more docs a query matches the slower things get.
The more hits you actually pull from the results (e.g. for displaying
or highlighting purposes) the slower things get.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Feb 23, 2:02 pm, Martin Bruse zondol...@gmail.com wrote:

Hello list,

I am working on a project where we will be modeling 1-n relations with
potentially hundreds of thousands of endpoints for a single
relationship.

We are currently storing the data in Riak, and this seems to work
fine. But the relations are of course a problem.

Riak links are implemented in a way that limits the number of links on
a single object to few thousands, so I figured I could use Riak Search
to search for relationship documents of the form "user_id: ...,
follower_id: ..."

However, it seems that Riak Search scales pretty badly when you get
hundreds of thousands of hits. Example: Counting the number of hits
when they are 100.000 takes 10 times as longs as counting the number
of hits when they are 10.000. Not too surprising, perhaps, but it
makes the count take several seconds for big relations.

So, I am thinking, can I do this with elasticsearch, or some other
lucene based engine?

This makes the question I would need an answer to this: Is counting
hits for a search having hundreds of thousands of hits in
elasticsearch a costly operation? Does it scale as linearly as in Riak
Search?

regards,
//Martin