Public/Private Index architecture

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the job to
migrate from Sphinx to Elastic Search since it can't handle well the amount
of data that we have.

We have actually about 3 billions documents and growing fast, and a almost
unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly accessed
by user. Some of the data is private and some is public (twitter, etc.) so
we had the idea to split the data into 2 indices (one public and one
private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time under
200ms and about 1000 QPS. We will have to ping the 2 indices at almost each
query.

So the killing question.. I was thinking about 50 nodes (32gb RAM with 8
cores) so 50 shards for the private index and 25 for the public maybe... do
you think our goals can be achieved with that amount of nodes and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of our data
with mitiged results (I mean there more optimization job to do...) Some of
the responses were very fast and some were still at 1 sec (for the users
with many relations with other users).

Any suggestions might be welcomed at this point...

--

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate clusters
that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores) so 50
shards for the private index and 25 for the public". I don't follow how
you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS), index some
more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide further
with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still giving
you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the job to
migrate from Sphinx to Elastic Search since it can't handle well the amount
of data that we have.

We have actually about 3 billions documents and growing fast, and a almost
unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time under
200ms and about 1000 QPS. We will have to ping the 2 indices at almost each
query.

So the killing question.. I was thinking about 50 nodes (32gb RAM with 8
cores) so 50 shards for the private index and 25 for the public maybe... do
you think our goals can be achieved with that amount of nodes and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of our
data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--

Thank you very much for your answer.

The thing is, the indices are not searched separately. So each search the
public AND the private indices are going to be searched... so I was
wondering if that may be a good way to go to split the data like this, even
if it is only to reduce the size of the index... Moreover, the 2 indices
are going to have the same object type (and mapping) and are going to be
queried at the same time. I'm pretty sure that the public data query is
going to respond in more time than the private since we can't really use
routing (we can, but it wouldn't change much since there is too much
user_id to ping that every shards or so is going to be pinged).

I've roughly gone through the steps you've mentionnend, but with only 1 big
index (private+public data) but since the queries depend so much on how
many relation with other user_id a user is having, that I don't know if
splitting the index would do any good. I'll continue on this way to see if
there is more improvement to have on this side.

Jerome

On Wednesday, October 10, 2012 11:49:02 PM UTC-4, Otis Gospodnetic wrote:

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate clusters
that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores) so 50
shards for the private index and 25 for the public". I don't follow how
you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS), index
some more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide further
with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still giving
you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the job
to migrate from Sphinx to Elastic Search since it can't handle well the
amount of data that we have.

We have actually about 3 billions documents and growing fast, and a
almost unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time
under 200ms and about 1000 QPS. We will have to ping the 2 indices at
almost each query.

So the killing question.. I was thinking about 50 nodes (32gb RAM with 8
cores) so 50 shards for the private index and 25 for the public maybe... do
you think our goals can be achieved with that amount of nodes and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of our
data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--

Another thing, when you say benchmark on one machine, on many shard would
you put on that specific machine ? I was putting the default (5) with
routing.

On Thursday, October 11, 2012 8:17:48 AM UTC-4, Jérôme Gagnon wrote:

Thank you very much for your answer.

The thing is, the indices are not searched separately. So each search the
public AND the private indices are going to be searched... so I was
wondering if that may be a good way to go to split the data like this, even
if it is only to reduce the size of the index... Moreover, the 2 indices
are going to have the same object type (and mapping) and are going to be
queried at the same time. I'm pretty sure that the public data query is
going to respond in more time than the private since we can't really use
routing (we can, but it wouldn't change much since there is too much
user_id to ping that every shards or so is going to be pinged).

I've roughly gone through the steps you've mentionnend, but with only 1
big index (private+public data) but since the queries depend so much on how
many relation with other user_id a user is having, that I don't know if
splitting the index would do any good. I'll continue on this way to see if
there is more improvement to have on this side.

Jerome

On Wednesday, October 10, 2012 11:49:02 PM UTC-4, Otis Gospodnetic wrote:

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate clusters
that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores) so
50 shards for the private index and 25 for the public". I don't follow how
you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS), index
some more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide
further with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still
giving you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the job
to migrate from Sphinx to Elastic Search since it can't handle well the
amount of data that we have.

We have actually about 3 billions documents and growing fast, and a
almost unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time
under 200ms and about 1000 QPS. We will have to ping the 2 indices at
almost each query.

So the killing question.. I was thinking about 50 nodes (32gb RAM with 8
cores) so 50 shards for the private index and 25 for the public maybe... do
you think our goals can be achieved with that amount of nodes and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of our
data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--

Hello Jérôme,

just one note: don't get too stressed about guessing the correct number of
shards for you indices. Maybe also don't try to overallocate shards
upfront, since you may end up with too many shards for your use case.

The most important thing to remember when designing layout for your data is
that you can have multiple indices, each with a different shards/replicas
settings, and tie them together with aliases
[Elasticsearch Platform — Find real-time answers at scale | Elastic].

@kimchy was talking about possible strategies in this
video: Elasticsearch Platform — Find real-time answers at scale | Elastic,
which I suggest you study. From your description, your data could well be
time based (“growing fast”), so time-based indices, with different
retention periods for different users etc. may be a good fit for you.

Karel

On Thursday, October 11, 2012 3:51:22 PM UTC+2, Jérôme Gagnon wrote:

Another thing, when you say benchmark on one machine, on many shard would
you put on that specific machine ? I was putting the default (5) with
routing.

On Thursday, October 11, 2012 8:17:48 AM UTC-4, Jérôme Gagnon wrote:

Thank you very much for your answer.

The thing is, the indices are not searched separately. So each search the
public AND the private indices are going to be searched... so I was
wondering if that may be a good way to go to split the data like this, even
if it is only to reduce the size of the index... Moreover, the 2 indices
are going to have the same object type (and mapping) and are going to be
queried at the same time. I'm pretty sure that the public data query is
going to respond in more time than the private since we can't really use
routing (we can, but it wouldn't change much since there is too much
user_id to ping that every shards or so is going to be pinged).

I've roughly gone through the steps you've mentionnend, but with only 1
big index (private+public data) but since the queries depend so much on how
many relation with other user_id a user is having, that I don't know if
splitting the index would do any good. I'll continue on this way to see if
there is more improvement to have on this side.

Jerome

On Wednesday, October 10, 2012 11:49:02 PM UTC-4, Otis Gospodnetic wrote:

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate
clusters that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores) so
50 shards for the private index and 25 for the public". I don't follow how
you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS), index
some more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide
further with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still
giving you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the job
to migrate from Sphinx to Elastic Search since it can't handle well the
amount of data that we have.

We have actually about 3 billions documents and growing fast, and a
almost unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time
under 200ms and about 1000 QPS. We will have to ping the 2 indices at
almost each query.

So the killing question.. I was thinking about 50 nodes (32gb RAM with
8 cores) so 50 shards for the private index and 25 for the public maybe...
do you think our goals can be achieved with that amount of nodes and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of our
data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--

I already looked at it thank you, but is it a good idea to use time-based
indices even if we search through all the indices at each query ?

That's my concern, but if you tell me that is not an issue, then is would
be perfect since I that would be the easiest way to scale over time without
having to reindexing everything by adjusting the shards of the new indices.

Jerome

On Friday, October 12, 2012 3:01:31 AM UTC-4, Karel Minařík wrote:

Hello Jérôme,

just one note: don't get too stressed about guessing the correct number of
shards for you indices. Maybe also don't try to overallocate shards
upfront, since you may end up with too many shards for your use case.

The most important thing to remember when designing layout for your data
is that you can have multiple indices, each with a different
shards/replicas settings, and tie them together with aliases [
Elasticsearch Platform — Find real-time answers at scale | Elastic
].

@kimchy was talking about possible strategies in this video:
Elasticsearch Platform — Find real-time answers at scale | Elastic,
which I suggest you study. From your description, your data could well be
time based (“growing fast”), so time-based indices, with different
retention periods for different users etc. may be a good fit for you.

Karel

On Thursday, October 11, 2012 3:51:22 PM UTC+2, Jérôme Gagnon wrote:

Another thing, when you say benchmark on one machine, on many shard would
you put on that specific machine ? I was putting the default (5) with
routing.

On Thursday, October 11, 2012 8:17:48 AM UTC-4, Jérôme Gagnon wrote:

Thank you very much for your answer.

The thing is, the indices are not searched separately. So each search
the public AND the private indices are going to be searched... so I was
wondering if that may be a good way to go to split the data like this, even
if it is only to reduce the size of the index... Moreover, the 2 indices
are going to have the same object type (and mapping) and are going to be
queried at the same time. I'm pretty sure that the public data query is
going to respond in more time than the private since we can't really use
routing (we can, but it wouldn't change much since there is too much
user_id to ping that every shards or so is going to be pinged).

I've roughly gone through the steps you've mentionnend, but with only 1
big index (private+public data) but since the queries depend so much on how
many relation with other user_id a user is having, that I don't know if
splitting the index would do any good. I'll continue on this way to see if
there is more improvement to have on this side.

Jerome

On Wednesday, October 10, 2012 11:49:02 PM UTC-4, Otis Gospodnetic wrote:

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate
clusters that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores) so
50 shards for the private index and 25 for the public". I don't follow how
you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched
separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS), index
some more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide
further with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still
giving you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the
job to migrate from Sphinx to Elastic Search since it can't handle well the
amount of data that we have.

We have actually about 3 billions documents and growing fast, and a
almost unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time
under 200ms and about 1000 QPS. We will have to ping the 2 indices at
almost each query.

So the killing question.. I was thinking about 50 nodes (32gb RAM with
8 cores) so 50 shards for the private index and 25 for the public maybe...
do you think our goals can be achieved with that amount of nodes and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of our
data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--

Jérôme,

Time-based indices (TBI?) are great when some queries search only the
subset of documents, such as last 24h or last 7 days or docs between Jan 1
2001 and March 30 2001.
They are good because searches are cheaper and because old indices can
efficiently be removed while not requiring changes in the search client.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, October 12, 2012 10:34:56 AM UTC-4, Jérôme Gagnon wrote:

I already looked at it thank you, but is it a good idea to use time-based
indices even if we search through all the indices at each query ?

That's my concern, but if you tell me that is not an issue, then is would
be perfect since I that would be the easiest way to scale over time without
having to reindexing everything by adjusting the shards of the new indices.

Jerome

On Friday, October 12, 2012 3:01:31 AM UTC-4, Karel Minařík wrote:

Hello Jérôme,

just one note: don't get too stressed about guessing the correct number
of shards for you indices. Maybe also don't try to overallocate shards
upfront, since you may end up with too many shards for your use case.

The most important thing to remember when designing layout for your data
is that you can have multiple indices, each with a different
shards/replicas settings, and tie them together with aliases [
Elasticsearch Platform — Find real-time answers at scale | Elastic
].

@kimchy was talking about possible strategies in this video:
Elasticsearch Platform — Find real-time answers at scale | Elastic,
which I suggest you study. From your description, your data could well be
time based (“growing fast”), so time-based indices, with different
retention periods for different users etc. may be a good fit for you.

Karel

On Thursday, October 11, 2012 3:51:22 PM UTC+2, Jérôme Gagnon wrote:

Another thing, when you say benchmark on one machine, on many shard
would you put on that specific machine ? I was putting the default (5) with
routing.

On Thursday, October 11, 2012 8:17:48 AM UTC-4, Jérôme Gagnon wrote:

Thank you very much for your answer.

The thing is, the indices are not searched separately. So each search
the public AND the private indices are going to be searched... so I was
wondering if that may be a good way to go to split the data like this, even
if it is only to reduce the size of the index... Moreover, the 2 indices
are going to have the same object type (and mapping) and are going to be
queried at the same time. I'm pretty sure that the public data query is
going to respond in more time than the private since we can't really use
routing (we can, but it wouldn't change much since there is too much
user_id to ping that every shards or so is going to be pinged).

I've roughly gone through the steps you've mentionnend, but with only 1
big index (private+public data) but since the queries depend so much on how
many relation with other user_id a user is having, that I don't know if
splitting the index would do any good. I'll continue on this way to see if
there is more improvement to have on this side.

Jerome

On Wednesday, October 10, 2012 11:49:02 PM UTC-4, Otis Gospodnetic
wrote:

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate
clusters that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores)
so 50 shards for the private index and 25 for the public". I don't follow
how you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched
separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS), index
some more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide
further with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still
giving you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the
job to migrate from Sphinx to Elastic Search since it can't handle well the
amount of data that we have.

We have actually about 3 billions documents and growing fast, and a
almost unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time
under 200ms and about 1000 QPS. We will have to ping the 2 indices at
almost each query.

So the killing question.. I was thinking about 50 nodes (32gb RAM
with 8 cores) so 50 shards for the private index and 25 for the public
maybe... do you think our goals can be achieved with that amount of nodes
and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of
our data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--

That's exactly what I thought... We need to search through all the indexed
data and not exactly over time, so TBI are not exactly what we need...
we're then back to one big freaking index with id_user routing. It works
well when we have not too much users, but as the number of filtered users
grows, the more latency grow (which is normal since we ping more shards).

One thing I'm worried about, is when we will need (and we will) to add more
shards to this big index to scale up, we will have to reindex everything...
and we're talking about billions documents reindexing here.

I'll probably start working with a 14 node cluster at the beginning at next
week, but first I need to find out how to index my documents, I'll let you
know of my achievements over this.

On Friday, October 12, 2012 2:27:06 PM UTC-4, Otis Gospodnetic wrote:

Jérôme,

Time-based indices (TBI?) are great when some queries search only the
subset of documents, such as last 24h or last 7 days or docs between Jan 1
2001 and March 30 2001.
They are good because searches are cheaper and because old indices can
efficiently be removed while not requiring changes in the search client.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, October 12, 2012 10:34:56 AM UTC-4, Jérôme Gagnon wrote:

I already looked at it thank you, but is it a good idea to use time-based
indices even if we search through all the indices at each query ?

That's my concern, but if you tell me that is not an issue, then is would
be perfect since I that would be the easiest way to scale over time without
having to reindexing everything by adjusting the shards of the new indices.

Jerome

On Friday, October 12, 2012 3:01:31 AM UTC-4, Karel Minařík wrote:

Hello Jérôme,

just one note: don't get too stressed about guessing the correct number
of shards for you indices. Maybe also don't try to overallocate shards
upfront, since you may end up with too many shards for your use case.

The most important thing to remember when designing layout for your data
is that you can have multiple indices, each with a different
shards/replicas settings, and tie them together with aliases [
Elasticsearch Platform — Find real-time answers at scale | Elastic
].

@kimchy was talking about possible strategies in this video:
Elasticsearch Platform — Find real-time answers at scale | Elastic,
which I suggest you study. From your description, your data could well be
time based (“growing fast”), so time-based indices, with different
retention periods for different users etc. may be a good fit for you.

Karel

On Thursday, October 11, 2012 3:51:22 PM UTC+2, Jérôme Gagnon wrote:

Another thing, when you say benchmark on one machine, on many shard
would you put on that specific machine ? I was putting the default (5) with
routing.

On Thursday, October 11, 2012 8:17:48 AM UTC-4, Jérôme Gagnon wrote:

Thank you very much for your answer.

The thing is, the indices are not searched separately. So each search
the public AND the private indices are going to be searched... so I was
wondering if that may be a good way to go to split the data like this, even
if it is only to reduce the size of the index... Moreover, the 2 indices
are going to have the same object type (and mapping) and are going to be
queried at the same time. I'm pretty sure that the public data query is
going to respond in more time than the private since we can't really use
routing (we can, but it wouldn't change much since there is too much
user_id to ping that every shards or so is going to be pinged).

I've roughly gone through the steps you've mentionnend, but with only
1 big index (private+public data) but since the queries depend so much on
how many relation with other user_id a user is having, that I don't know if
splitting the index would do any good. I'll continue on this way to see if
there is more improvement to have on this side.

Jerome

On Wednesday, October 10, 2012 11:49:02 PM UTC-4, Otis Gospodnetic
wrote:

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate
clusters that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores)
so 50 shards for the private index and 25 for the public". I don't follow
how you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched
separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS),
index some more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide
further with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still
giving you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the
job to migrate from Sphinx to Elastic Search since it can't handle well the
amount of data that we have.

We have actually about 3 billions documents and growing fast, and a
almost unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time
under 200ms and about 1000 QPS. We will have to ping the 2 indices at
almost each query.

So the killing question.. I was thinking about 50 nodes (32gb RAM
with 8 cores) so 50 shards for the private index and 25 for the public
maybe... do you think our goals can be achieved with that amount of nodes
and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of
our data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--

That's exactly what I thought... We need to search through all the indexed
data and not exactly over time, so TBI are not exactly what we need...
we're then back to one big freaking index with id_user routing. It works
well when we have not too much users, but as the number of filtered users
grows, the more latency grow (which is normal since we ping more shards).

One thing I'm worried about, is when we will need (and we will) to add more
shards to this big index to scale up, we will have to reindex everything...
and we're talking about billions documents reindexing here.

I'll probably start working with a 14 node cluster at the beginning at next
week, but first I need to find out how to index my documents, I'll let you
know of my achievements over this.

On Friday, October 12, 2012 2:27:06 PM UTC-4, Otis Gospodnetic wrote:

Jérôme,

Time-based indices (TBI?) are great when some queries search only the
subset of documents, such as last 24h or last 7 days or docs between Jan 1
2001 and March 30 2001.
They are good because searches are cheaper and because old indices can
efficiently be removed while not requiring changes in the search client.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, October 12, 2012 10:34:56 AM UTC-4, Jérôme Gagnon wrote:

I already looked at it thank you, but is it a good idea to use time-based
indices even if we search through all the indices at each query ?

That's my concern, but if you tell me that is not an issue, then is would
be perfect since I that would be the easiest way to scale over time without
having to reindexing everything by adjusting the shards of the new indices.

Jerome

On Friday, October 12, 2012 3:01:31 AM UTC-4, Karel Minařík wrote:

Hello Jérôme,

just one note: don't get too stressed about guessing the correct number
of shards for you indices. Maybe also don't try to overallocate shards
upfront, since you may end up with too many shards for your use case.

The most important thing to remember when designing layout for your data
is that you can have multiple indices, each with a different
shards/replicas settings, and tie them together with aliases [
Elasticsearch Platform — Find real-time answers at scale | Elastic
].

@kimchy was talking about possible strategies in this video:
Elasticsearch Platform — Find real-time answers at scale | Elastic,
which I suggest you study. From your description, your data could well be
time based (“growing fast”), so time-based indices, with different
retention periods for different users etc. may be a good fit for you.

Karel

On Thursday, October 11, 2012 3:51:22 PM UTC+2, Jérôme Gagnon wrote:

Another thing, when you say benchmark on one machine, on many shard
would you put on that specific machine ? I was putting the default (5) with
routing.

On Thursday, October 11, 2012 8:17:48 AM UTC-4, Jérôme Gagnon wrote:

Thank you very much for your answer.

The thing is, the indices are not searched separately. So each search
the public AND the private indices are going to be searched... so I was
wondering if that may be a good way to go to split the data like this, even
if it is only to reduce the size of the index... Moreover, the 2 indices
are going to have the same object type (and mapping) and are going to be
queried at the same time. I'm pretty sure that the public data query is
going to respond in more time than the private since we can't really use
routing (we can, but it wouldn't change much since there is too much
user_id to ping that every shards or so is going to be pinged).

I've roughly gone through the steps you've mentionnend, but with only
1 big index (private+public data) but since the queries depend so much on
how many relation with other user_id a user is having, that I don't know if
splitting the index would do any good. I'll continue on this way to see if
there is more improvement to have on this side.

Jerome

On Wednesday, October 10, 2012 11:49:02 PM UTC-4, Otis Gospodnetic
wrote:

Hi,

You mention 50 nodes. Wouldn't it be simpler to have 2 separate
clusters that you can scale independently?
You also say " I was thinking about 50 nodes (32gb RAM with 8 cores)
so 50 shards for the private index and 25 for the public". I don't follow
how you jump from 50 nodes/servers to 50+25 shards.

Here is what may help:
10 keep 2 clusters separate (it sounds like they are searched
separately)
20 index some amount of data to just 1 node
30 hammer it with N QPS and watch the query latency
40 if there is more room (i.e., latency is < 200 ms at 100 QPS),
index some more
50 GOTO 30

What's N?
Imagine you have 10 nodes available for this cluster you are testing.
1000/10 = 100 QPS on each node -- that's your N
The above assumes there are no replicas to share the load. Divide
further with the number of replicas.

I didn't describe all details here, but I think you get the idea.
Once you understand how much a single server can handle while still
giving you < 200ms responses, you can extrapolate...

Elasticsearch Monitoring will
help you understand the behaviour of your server/cluster while you test.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 9, 2012 11:21:56 AM UTC-4, Jérôme Gagnon wrote:

First of all, greetings everyone,

I am working for a company that actually use Sphinx, but I have the
job to migrate from Sphinx to Elastic Search since it can't handle well the
amount of data that we have.

We have actually about 3 billions documents and growing fast, and a
almost unlimited budget for an Elastic Search cluster.

We are aggregating many data from multiple source, and it's mainly
accessed by user. Some of the data is private and some is public (twitter,
etc.) so we had the idea to split the data into 2 indices (one public and
one private). Basically the idea behind that is that the private index will
have a large amount of shards to do overallocation and routing by users id
since it is always accessed this way. For the public data, we have the idea
of putting it into a big public index, but this for this part, the queries
might be more painful since there is no routing.

This is actually my problem... we ideally want to have response time
under 200ms and about 1000 QPS. We will have to ping the 2 indices at
almost each query.

So the killing question.. I was thinking about 50 nodes (32gb RAM
with 8 cores) so 50 shards for the private index and 25 for the public
maybe... do you think our goals can be achieved with that amount of nodes
and data ?

I have runned some tests with 6 nodes with 1 big index and 1/10 of
our data with mitiged results (I mean there more optimization job to do...)
Some of the responses were very fast and some were still at 1 sec (for the
users with many relations with other users).

Any suggestions might be welcomed at this point...

--