Use case to use Elasticsearch

I need to create a system where the user creates dynamic filters based on our customer's attributes. There are more or less 30 possible filters and 30 millions of customers, but the number of customer increase every day and the attribute value can change every day too, so we have insert and updates in these set of data every day. Another thing is that I can create a new filter or remove.

This customer data that I will filter is data processed from other databases, so this data set is not the original, every day these data will be processed in another place and I'll load into Elasticsearch the processed data with all attributes values. If something changed from the customer I'll update the customer attributes or if it's a new customer, insert a new document. This process will run every day and I'll need to update or create thousands of customers.

This use case, is it a good fit to use Elastisearch? Because I can't create an index based on date or something like that and In this case I will have one index with all my customers.

I need to return a count of customers that match these filters at most in 1/2 seconds, also in the future I'll have to export all customer id that match the filters and I have to keep all my customers, so the retention period is forever.

Some attributes:

  • Downloaded the app (boolean)
  • Credit card limit (number)
  • Last transaction (date)
  • Status (text)
  • Last access (date)
  • How many times used the credit card (number)
  • City (text)
  • Average transaction value (number)

The user can use >, <, =, >=, <= to filter or use IN, like city IN ('New York', 'Seattle').


That sounds like a good fit. An index can have many shards and each shard can hold up to 2 billion dicuments, so storing your data in a single index should be fine.

1 Like

Welcome Mac,

this can be good fit.
I have 120 million document in a index with customer id has primary key(in RDBMS term)
means record is writen/update/delete by that customer id.
one record per one customer_id

I think you have pretty much same concept
I think in your case you have to create document_id by combination of (some more field from record set) this way if anything gets updated in primary database you can replicate that on ELK.

I am also new to ELK but more I work on it more I know what it is capable to do it. LOT of thing

1 Like

Thanks @Christian_Dahlqvist !

Thanks @elasticforme! Cool! Can you tell me the cluster hardware you use for this index with 120 million document? Thanks!

I have five data node. three master and one logstash

but this cluster not only have this index. it has many more. infact it has more then 600 million total docuements

all five data node has 2x1tb SSD for storage. each disk is just setup as individual disk. no raid.
reason we don't need is because each index has one replica and if you have one failed disk you will not loose data

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.