We do something like a typical e-commerce platform with custom product search based on Elasticsearch (a service which simply send queries to Elasticsearch). Our product search service worked flawlessly, but recently we needed some new features, namely per-user pricing and per-user product limits. So far, we haven't stored any per-user data, so this is a huge problem from our perspective, and we should probably redesign the entire search service. In addition, the products in the index are updated periodically, such as every 30 minutes.
The platform is used by about 10k users, and there are about 5k products in the index. I would like to advise what is the typical solution to this type of problem? The first thing that comes to mind is a separate document in elasticsearch for each user and product, but then you would get 5k * 10k separate documents which seems too much. Or should we use the parent/child feature in Elasticsearch to reduce duplicate data? What are your tips and experiences?
If I calculate correctly that would be 50 million documents, which is not a lot at all in Elasticsearch terms. A single shard in Elasticsearch has a limit of around 2 billion documents, and although this rarely is optimal, it shows that 50 million is not that much and likely would fit fine in just a single shard. For larger data volumes the index can be made to have multiple primary shards, so an index can hold an even larger number of documents.
Use a flat and denormalised data structure. It makes queries simpler and faster and scales well.
@Christian_Dahlqvist, thank you for your response.
I estimate that the size of the index containing 50 million documents would be around 25 GB. The biggest problem seems to be that updating such a large amount of data would probably take many hours. I can imagine a situation where we change the name of a product, which would result in changing 10k documents. We can also imagine a situation where a product is sold out and stock is empty. Here, we would also have to update 10k documents, and users would receive information about it after, for example, 6 hours which is too long. We also need to keep information about stock in the index, for example, for the possibility of filtering by it. We would probably have to spread it across many separate primary shards. So far we had 4 nodes, but all of them were used for redundancy.
It's interesting to see how this is implemented in other projects. It seems that many online stores have a dedicated price for each user. I hope that this is not closely guarded knowledge and that someone could share their experiences
Someone might ask, why do we want to keep the price information in the index? Because it is necessary for the user to be able to sort and filter by it, for example.
What is the average size of your documents? What ate you basing the update duration rstimate on? If you did a test, did you make sure you used the bulk API?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.