Index split on imbalance user data

Wesley_Hsiung · June 22, 2019, 2:41am

our 1 million customers produce 1 billion transactional data every year, but top 50 customers account for more than half data (500 million docs),
and I cann't know these top customers in advance, b/z every year may have new top 50 customer,
the problem is all customers wish they can search their own last 3 years data,
I split all data on time(1 season in an index), but has search performance issue , so any good advice on index splitting on such scenario? how should I split the indices? since the data base on customer id is extremly imbalance, it seems index_route = customer_id mod indices_number is not a good idea.

the doc is about 35 fields, about 6 fields are long text(short than 50 char),others are date,keyword,double.

most common query is a customer want search last year(or latest 20 month) his own tranctions data on some of his sub accounts(every customer have 1 to 100 sub-accounts for business) with some match on long text fileds(no need score,just filter is ok) and wish get those filtered tranctions docs and aslo sum of those docs.

the cluster is 2 client nodes,6 data nodes with 1TB storge each node,3 master nodes, all nodes are 8G heap size(can change to 32G max if need), the primary shards number can only be 6(as data nodes) as segguested by DBA, the whole cluster is only for these tranctions data, now it has three years data with 13 indices(one for a season), the last season index max shards has about 150G.

thanks a lot

Christian_Dahlqvist · June 22, 2019, 6:13am

How large and complex are your documents? What type of queries do you run against this data? How large is the cluster? What is the hardware specification? How many indices and shards do you have per season? What is the average shard size? How many queries do you serve per time period?

Wesley_Hsiung · June 22, 2019, 6:58am

the doc is about 35 fields, about 6 fields are long text(short than 50 char),others are date,keyword,double. most common query is a customer want search last year(or latest 20 month) his own tranctions data on some of his sub accounts(every customer have 1 to 100 sub-accounts for business) with some match on long text fileds(no need score,just filter is ok) and wish get those filtered tranctions docs and aslo sum of those docs.
the cluster is 2 client nodes,6 data nodes with 1TB storge each node,3 master nodes, all nodes are 8G heap size(can change to 32G max if need), the primary shards number can only be 6(as data nodes) as segguested by DBA, the whole cluster is only for these tranctions data, now it has three years data with 13 indices(one for a season), the last season index max shards has about 150G.

Christian_Dahlqvist · June 22, 2019, 9:03am

How many indices do you have in the cluster? What type of stoarge do you have?

Wesley_Hsiung · June 23, 2019, 12:47am

Thanks, Christian. Assume now I have a new cluster only for this problem: 2 client nodes,3 master nodes,12 data nodes,each data node with 1TB HDD storge, and every node has 32GB memory.

The problem is I need store 3years transactional data produced by 1million customer, the data is extremely imbalance- every year about half data is from top 50 customers.(can not predict which customer will be top customer next year) and the customers wish search their own orders in latest 12 months or 30 months, so how should I create the indices? first split on every year, and then split on customer id hash? what about those top customer? their data in 1y are extreme huge for only one index.

Christian_Dahlqvist · June 23, 2019, 9:27am

Before diving into suggestions on how to shard and organize this data I would recommend monitoring what disk I/O and iowait looks like on the nodes as this often is the bottleneck for indexing as well as query performance. Depending on how much querying you expect, the disks you have may provide enough I/O. If they become the bottleneck I would recommend upgrading to SSDs. You should also monitor heap usage and increase this if you see any evidence of heap pressure and/or frequent or slow GC.

Given the number of customers you have I would recommend using document routing on customer id if you are not already. This means that only one shard per time period will be searched at query time, reducing the amount of disk I/O. Given that you have a number of customers that are much larger than others your shard sizes will not be completely even in size, but I still think its worth using. Note that you need to set routing key both at index and query time.

There is nothing that stipulates that the number of primary shards has to equal the number of data nodes you have, so here I would recommend deviating from this. As you only search 1 shard per index when using routing, the data volume that need to be searched will shrink with increased number of primary shards per index. If you go with 12 primary shards per index and also switch to monthly indices the maximum shard size will shrink and you will on average search 1/12 of your data. This would give you 864 shards (assuming 1 replica) to cover a 3 year period, which is 144 shards per data node, which is quite reasonable.

Try this and see how it works for you.

system · July 21, 2019, 9:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Public/Private Index architecture Elasticsearch	9	1268	July 6, 2017
Tradeoffs for using week/month (time) based indices Elasticsearch	3	415	July 6, 2017
Looking for a suggestion to better organize our indices for performance Elasticsearch	6	395	July 6, 2017
Multiple indices vs. multiple shards approach Elasticsearch	10	2313	November 4, 2022
Index design for web activity Elasticsearch	3	447	July 6, 2017

Index split on imbalance user data

Related topics