Architecture and performance question on searching small subsets of documents


(Mike Topper) #1

Hello,

I'm new elasticsearch, so this might be a stupid question but i'd love some
input before I get started creating my elasticsearch cluster.

Basically I will be indexing documents with a few fields (documents are
pretty small in size). there are ~90million documents total.

On the search side of things, each search will be limited by the small
subset of documents that the user doing the search owns.

my initial thought was to just have one large index for all documents and
have a multi-value field that held the user ids of each user that owned
that document. then when searching across the index i would do a filter
query to limit by that user id. My only concern here is that this might be
slow query times because you are always having to filter down by user id
from a large data set to a very small subset (on average a user probably
owns less than 1k documents).

The other option I had is that i could create an index for each user and
just index their documents into their index, but this would duplicate a
massive amount of data and just seems hacky.

Any suggestions?

Thanks,
Mike

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALdNedL-sLM%3DyMWsHHzriBmMwfe08mxVG%3D%3D9tSwxLwiWzfAcyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #2

Look at routing. It will help by limiting the searches to the shard with
the user's data. Beyond that, you can generally trust the caching on
filters to make this kind of use case quick. At least that is what I've
seen on the mailing list.
On Jul 15, 2014 9:27 AM, "Mike Topper" topper@gmail.com wrote:

Hello,

I'm new elasticsearch, so this might be a stupid question but i'd love
some input before I get started creating my elasticsearch cluster.

Basically I will be indexing documents with a few fields (documents are
pretty small in size). there are ~90million documents total.

On the search side of things, each search will be limited by the small
subset of documents that the user doing the search owns.

my initial thought was to just have one large index for all documents and
have a multi-value field that held the user ids of each user that owned
that document. then when searching across the index i would do a filter
query to limit by that user id. My only concern here is that this might be
slow query times because you are always having to filter down by user id
from a large data set to a very small subset (on average a user probably
owns less than 1k documents).

The other option I had is that i could create an index for each user and
just index their documents into their index, but this would duplicate a
massive amount of data and just seems hacky.

Any suggestions?

Thanks,
Mike

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALdNedL-sLM%3DyMWsHHzriBmMwfe08mxVG%3D%3D9tSwxLwiWzfAcyw%40mail.gmail.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3ccrAyDNZ8eR1xiG0fUvVw%3DAWY_iXTGmik16cdDRKL3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #3

Having a single large index is probably the best option, and will scale
better when growing your base of users.

I would recommend to watch the following video:
http://www.elasticsearch.org/videos/big-data-search-and-analytics/. The
part you are interested in starts at 13:20.

On Tue, Jul 15, 2014 at 3:26 PM, Mike Topper topper@gmail.com wrote:

Hello,

I'm new elasticsearch, so this might be a stupid question but i'd love
some input before I get started creating my elasticsearch cluster.

Basically I will be indexing documents with a few fields (documents are
pretty small in size). there are ~90million documents total.

On the search side of things, each search will be limited by the small
subset of documents that the user doing the search owns.

my initial thought was to just have one large index for all documents and
have a multi-value field that held the user ids of each user that owned
that document. then when searching across the index i would do a filter
query to limit by that user id. My only concern here is that this might be
slow query times because you are always having to filter down by user id
from a large data set to a very small subset (on average a user probably
owns less than 1k documents).

The other option I had is that i could create an index for each user and
just index their documents into their index, but this would duplicate a
massive amount of data and just seems hacky.

Any suggestions?

Thanks,
Mike

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALdNedL-sLM%3DyMWsHHzriBmMwfe08mxVG%3D%3D9tSwxLwiWzfAcyw%40mail.gmail.com
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6cE4Nc66cP1TAyTjdWUqg0GiGj34ZpFcsjAZ8GR8cUjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Michael McCandless) #4

Try the filter approach first and only if performance isn't good enough,
look into other approaches. Lucene is quite fast at intersecting filters
with large postings lists these days...

Separate index per user is not only wasteful, because of the duplicated
content, but will consume substantially more RAM/disk/file descriptors just
because of the overhead required for an index.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 15, 2014 at 9:26 AM, Mike Topper topper@gmail.com wrote:

Hello,

I'm new elasticsearch, so this might be a stupid question but i'd love
some input before I get started creating my elasticsearch cluster.

Basically I will be indexing documents with a few fields (documents are
pretty small in size). there are ~90million documents total.

On the search side of things, each search will be limited by the small
subset of documents that the user doing the search owns.

my initial thought was to just have one large index for all documents and
have a multi-value field that held the user ids of each user that owned
that document. then when searching across the index i would do a filter
query to limit by that user id. My only concern here is that this might be
slow query times because you are always having to filter down by user id
from a large data set to a very small subset (on average a user probably
owns less than 1k documents).

The other option I had is that i could create an index for each user and
just index their documents into their index, but this would duplicate a
massive amount of data and just seems hacky.

Any suggestions?

Thanks,
Mike

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALdNedL-sLM%3DyMWsHHzriBmMwfe08mxVG%3D%3D9tSwxLwiWzfAcyw%40mail.gmail.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRcj-ug82icWsPntUaXsYLJNwTcYgcHzD7tLk1SMi0PVQg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5