our 1 million customers produce 1 billion transactional data every year, but top 50 customers account for more than half data (500 million docs),
and I cann't know these top customers in advance, b/z every year may have new top 50 customer,
the problem is all customers wish they can search their own last 3 years data,
I split all data on time(1 season in an index), but has search performance issue , so any good advice on index splitting on such scenario? how should I split the indices? since the data base on customer id is extremly imbalance, it seems index_route = customer_id mod indices_number is not a good idea.
the doc is about 35 fields, about 6 fields are long text(short than 50 char),others are date,keyword,double.
most common query is a customer want search last year(or latest 20 month) his own tranctions data on some of his sub accounts(every customer have 1 to 100 sub-accounts for business) with some match on long text fileds(no need score,just filter is ok) and wish get those filtered tranctions docs and aslo sum of those docs.
the cluster is 2 client nodes,6 data nodes with 1TB storge each node,3 master nodes, all nodes are 8G heap size(can change to 32G max if need), the primary shards number can only be 6(as data nodes) as segguested by DBA, the whole cluster is only for these tranctions data, now it has three years data with 13 indices(one for a season), the last season index max shards has about 150G.
thanks a lot