Let's say I have 50 million documents and I need to group matched documents together (i.e) form multiple groups of documents and all documents in a group are almost similar. Before even doing the actual matches, I grouped these 50 million documents using some id.I divided them into multiple groups in a way that a document in one group won't match with a document in another group.The number of such groups is around 6 million(i.e) 50 million documents are grouped to 6million.I can't simply take a group and match each document against all other documents in a group which will introduce performance issue.So, I decided to use elastic search.Now, I am not sure which is the proper design to address my problem.I have tried the following approaches
a) Number of nodes - 6,Number of shards - 10,Number of replicas - 0,Number of index -1
I was processing 30 groups in parallel.While processing a group, take each document and search it in es index.If it matches with any other document in the index, then don't index it else index the same.Since I am processing 30 groups in parallel,reads and writes are parallel but no bulk write was happening.This took 3-4 days to complete the entire process
b) Number of nodes - 6,Number of shards - 1,Number of replicas - 0,Number of index - index per group
While processing a group, create an index on the fly, take each document and search it in the newly created index.If it matches with any other document in the index, then don't index it else index the same.Delete the index.The performance was better but didn't see very good improvement.So, I did bulk write while processing a group.It helper somehow but my cluster status frequently turned to yellow and I also saw heavy loading on the ES machine.
c) Number of nodes - 6,Number of shards - 6,Number of replicas - 0,Number of index - 1
I used the "users" data flow model described in https://www.elastic.co/videos/big-data-search-and-analytics.Created alias per group and so, totally 5 million aliases.After processing 10 million documents, I can see a drop in performance.For the first 10 million documents, it took 3-4 mins to process 30,000 documents(write and read) while it took 7 mins to process the same later.
I don't really want these documents to stay in a cluster after the process execution(That's why I called out in the title).I am new to elastic search.is there any best way to address my problem?