Index design for web activity

Hey Guys,
Wanna seek your suggestions on the index design for web activities.
Lets say I have browse data, online purchase data, and store purchase
data, and I will need to save a year of them.
For browse data, a year of data is around 80G , online purchase data is
around 50G, and offline data is around 1T.

I have to do query like, e.g, find all the customers who browsed item A in
the past X months, and also online purchased B in the past Y month.
Originally I am using complicated parent/child structure, and that
sometimes results in very bad performance. and I store all browse
data/online purchase/store purchase in one index distributed to 7 shards.

I have 7 machines with 128G each, and 1T hard disk.

Now, I am trying to save each of those type of data into its own index, say
browse_v1, onlinepurchase_v1, storepurchase_v1. Since its time based data,
how should I decide to break them into monthly , or simply yearly? for
browse(70G)/online purchase(50G), i think i can just use one index and one
shard for them,. or should I break them into monthly data instead? breaking
into monthly indexes gives me the flexibility of adding/removing data, but
it also will decrease the query performance, right? (search against 1 index
now becomes search against 12 indexes).

For store data(1T) apparently I have to break them into at least monthly
index, but each monthly index still contains around 100G data. With my
current cluster, how many shards should I allocate to each monthly index? I
am also concerned about the query performance.

Then since I am now storing them into separate indexes, to achieve the
query I want, I will need to do application level join. Is this the common
way to handle such user case?

I know I should perform some testing first, but hope someone may have
similar experience in handling this and could provide some guidance.

thanks in advance,
Chen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2cba8839-2577-4fd7-b1e9-550ae579bb1a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Sat, Dec 20, 2014 at 12:53 AM, Chen Wang chen.apache.solr@gmail.com
wrote:

Hey Guys,
Wanna seek your suggestions on the index design for web activities.
Lets say I have browse data, online purchase data, and store purchase
data, and I will need to save a year of them.
For browse data, a year of data is around 80G , online purchase data is
around 50G, and offline data is around 1T.

I have to do query like, e.g, find all the customers who browsed item A in
the past X months, and also online purchased B in the past Y month.
Originally I am using complicated parent/child structure, and that
sometimes results in very bad performance. and I store all browse
data/online purchase/store purchase in one index distributed to 7 shards.

Parent/child is indeed slow. Can you somehow denormalize your data to make
queries faster?

I have 7 machines with 128G each, and 1T hard disk.

Now, I am trying to save each of those type of data into its own index,
say browse_v1, onlinepurchase_v1, storepurchase_v1. Since its time based
data, how should I decide to break them into monthly , or simply yearly?
for browse(70G)/online purchase(50G), i think i can just use one index and
one shard for them,. or should I break them into monthly data instead?
breaking into monthly indexes gives me the flexibility of adding/removing
data, but it also will decrease the query performance, right? (search
against 1 index now becomes search against 12 indexes).

For store data(1T) apparently I have to break them into at least monthly
index, but each monthly index still contains around 100G data. With my
current cluster, how many shards should I allocate to each monthly index? I
am also concerned about the query performance.

Then since I am now storing them into separate indexes, to achieve the
query I want, I will need to do application level join. Is this the common
way to handle such user case?

As much as possible, you should try to design you documents in such a way
that you don't need to perform joins at search time. Would it be possible
for you to adopt a more "entity-centric" approach at indexing time?

I know I should perform some testing first, but hope someone may have
similar experience in handling this and could provide some guidance.

The Elasticsearch book has a chapter about "designing for scale" that gives
good advices around modeling the data and chosing the right shard size and
numbers of shards:

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5m3NXxXhj1EZCJ-H_3fc0kyjiJCqXKNBqfYbcXo3Mxdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Adrien,
Is there a more clear version of the video record? I can barely see the
slides, and don't quite get the idea of entity-centric..
Does it mean, for my user case, to maintain a single user document that
contains list of activities, and during the index time, just simply update
the list of this property?
something like:
{
_source:{
customer_id: 123
browse: [{item1, time1},{item2, time2}],
purchase: [{item1,time1},{item2, time2}],

}
}

during the index time, I just update the browse/purchase list?
Then my query basically becomes flat.

Is my understanding correct?
Chen

On Sunday, December 21, 2014 at 1:54:48 PM UTC-8, Adrien Grand wrote:

On Sat, Dec 20, 2014 at 12:53 AM, Chen Wang <chen.apa...@gmail.com
<javascript:>> wrote:

Hey Guys,
Wanna seek your suggestions on the index design for web activities.
Lets say I have browse data, online purchase data, and store purchase
data, and I will need to save a year of them.
For browse data, a year of data is around 80G , online purchase data is
around 50G, and offline data is around 1T.

I have to do query like, e.g, find all the customers who browsed item A
in the past X months, and also online purchased B in the past Y month.
Originally I am using complicated parent/child structure, and that
sometimes results in very bad performance. and I store all browse
data/online purchase/store purchase in one index distributed to 7 shards.

Parent/child is indeed slow. Can you somehow denormalize your data to make
queries faster?

I have 7 machines with 128G each, and 1T hard disk.

Now, I am trying to save each of those type of data into its own index,
say browse_v1, onlinepurchase_v1, storepurchase_v1. Since its time based
data, how should I decide to break them into monthly , or simply yearly?
for browse(70G)/online purchase(50G), i think i can just use one index and
one shard for them,. or should I break them into monthly data instead?
breaking into monthly indexes gives me the flexibility of adding/removing
data, but it also will decrease the query performance, right? (search
against 1 index now becomes search against 12 indexes).

For store data(1T) apparently I have to break them into at least monthly
index, but each monthly index still contains around 100G data. With my
current cluster, how many shards should I allocate to each monthly index? I
am also concerned about the query performance.

Then since I am now storing them into separate indexes, to achieve the
query I want, I will need to do application level join. Is this the common
way to handle such user case?

As much as possible, you should try to design you documents in such a way
that you don't need to perform joins at search time. Would it be possible
for you to adopt a more "entity-centric" approach at indexing time?
Elasticsearch Platform — Find real-time answers at scale | Elastic

I know I should perform some testing first, but hope someone may have
similar experience in handling this and could provide some guidance.

The Elasticsearch book has a chapter about "designing for scale" that
gives good advices around modeling the data and chosing the right shard
size and numbers of shards:
Elasticsearch Platform — Find real-time answers at scale | Elastic

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9c6ac540-2d77-49de-85b4-7fd1574ff2ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.