We have a multi tenant SAAS application in which we keep data for all
accounts of our clients (300 of them which we call services).
We keep data in monthly indices that grew to be about 700GB with 4.6
billion documents each month.
Each day we index a new account per day for each service.
Each index is built from 6 shards and we use 1 replica.
We're starting to have second thoughts of the structure of our indices and
about proper use of routing.
Few things we contemplate:
Use routing according to service so that we will probably benefit from
caching better.
Change the indices according to service + month so that we will query
much less data, but will add many indices (now instead of 12 indices a year
we will have 300x12 and growing when the number of clients grow).
We have a multi tenant SAAS application in which we keep data for all
accounts of our clients (300 of them which we call services).
We keep data in monthly indices that grew to be about 700GB with 4.6
billion documents each month.
Each day we index a new account per day for each service.
Each index is built from 6 shards and we use 1 replica.
We're starting to have second thoughts of the structure of our indices and
about proper use of routing.
Few things we contemplate:
Use routing according to service so that we will probably benefit
from caching better.
Change the indices according to service + month so that we will
query much less data, but will add many indices (now instead of 12 indices
a year we will have 300x12 and growing when the number of clients grow).
We have a multi tenant SAAS application in which we keep data for all
accounts of our clients (300 of them which we call services).
We keep data in monthly indices that grew to be about 700GB with 4.6
billion documents each month.
Each day we index a new account per day for each service.
Each index is built from 6 shards and we use 1 replica.
We're starting to have second thoughts of the structure of our indices
and about proper use of routing.
Few things we contemplate:
Use routing according to service so that we will probably benefit
from caching better.
Change the indices according to service + month so that we will
query much less data, but will add many indices (now instead of 12 indices
a year we will have 300x12 and growing when the number of clients grow).
Currently you have shards upwards of over 100GB, which is massive and
probably causing you some issues. Ideally you should be aiming for a max
shard size of 40-50GB, so increasing your shard count to 24 brings you
under this level and also gives you room for growth on an index level.
Having a higher shard count also spreads the query load, and reduces the
amount of thrashing (ie data transfer) if/when a node goes down.
We have a multi tenant SAAS application in which we keep data for all
accounts of our clients (300 of them which we call services).
We keep data in monthly indices that grew to be about 700GB with 4.6
billion documents each month.
Each day we index a new account per day for each service.
Each index is built from 6 shards and we use 1 replica.
We're starting to have second thoughts of the structure of our indices
and about proper use of routing.
Few things we contemplate:
Use routing according to service so that we will probably benefit
from caching better.
Change the indices according to service + month so that we will
query much less data, but will add many indices (now instead of 12 indices
a year we will have 300x12 and growing when the number of clients grow).
Indeed increase your shard count. Also, you may want to consider using a
routing parameter based on e.g. a tenant_id to ensure all queries related
to a tenant only hit shards that actually have data for that tenant. Those
two measures would reduce the size of each shard and the number of shards
involved for each tenant. To increase query capacity, you could consider
increasing the number of replicas as well this ways, you have more nodes
that can handle query traffic for the same data.
Jilles
On Tuesday, December 9, 2014 3:56:06 PM UTC+1, Mark Walkom wrote:
Currently you have shards upwards of over 100GB, which is massive and
probably causing you some issues. Ideally you should be aiming for a max
shard size of 40-50GB, so increasing your shard count to 24 brings you
under this level and also gives you room for growth on an index level.
Having a higher shard count also spreads the query load, and reduces the
amount of thrashing (ie data transfer) if/when a node goes down.
On 9 December 2014 at 15:50, Ron Sher <ron....@gmail.com <javascript:>>
wrote:
we have 24 data nodes, 3 master nodes and 3 client nodes.
We use m3.4xlarge for the data nodes
On Tue, Dec 9, 2014 at 4:37 PM, Mark Walkom <markw...@gmail.com
<javascript:>> wrote:
How many servers are in this cluster?
On 9 December 2014 at 14:36, Ron Sher <ron....@gmail.com <javascript:>>
wrote:
Hi,
We have a multi tenant SAAS application in which we keep data for all
accounts of our clients (300 of them which we call services).
We keep data in monthly indices that grew to be about 700GB with 4.6
billion documents each month.
Each day we index a new account per day for each service.
Each index is built from 6 shards and we use 1 replica.
We're starting to have second thoughts of the structure of our indices
and about proper use of routing.
Few things we contemplate:
Use routing according to service so that we will probably benefit
from caching better.
Change the indices according to service + month so that we will
query much less data, but will add many indices (now instead of 12 indices
a year we will have 300x12 and growing when the number of clients grow).
On Tuesday, December 9, 2014 5:52:55 PM UTC+2, Jilles van Gurp wrote:
Indeed increase your shard count. Also, you may want to consider using a
routing parameter based on e.g. a tenant_id to ensure all queries related
to a tenant only hit shards that actually have data for that tenant. Those
two measures would reduce the size of each shard and the number of shards
involved for each tenant. To increase query capacity, you could consider
increasing the number of replicas as well this ways, you have more nodes
that can handle query traffic for the same data.
Jilles
On Tuesday, December 9, 2014 3:56:06 PM UTC+1, Mark Walkom wrote:
Currently you have shards upwards of over 100GB, which is massive and
probably causing you some issues. Ideally you should be aiming for a max
shard size of 40-50GB, so increasing your shard count to 24 brings you
under this level and also gives you room for growth on an index level.
Having a higher shard count also spreads the query load, and reduces the
amount of thrashing (ie data transfer) if/when a node goes down.
We have a multi tenant SAAS application in which we keep data for all
accounts of our clients (300 of them which we call services).
We keep data in monthly indices that grew to be about 700GB with 4.6
billion documents each month.
Each day we index a new account per day for each service.
Each index is built from 6 shards and we use 1 replica.
We're starting to have second thoughts of the structure of our indices
and about proper use of routing.
Few things we contemplate:
Use routing according to service so that we will probably
benefit from caching better.
Change the indices according to service + month so that we will
query much less data, but will add many indices (now instead of 12 indices
a year we will have 300x12 and growing when the number of clients grow).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.