I'm currently solving this poorly, not leveraging es. We recently scaled our data and my system crashes. I need an efficient solution to get back online. Thank you!
I have a list of 300,000 unique strings with different number of words in each. Each string has statistics associated with it.
I want to run a ngram, bigram, and trigram and aggregate the statistics by each unique combination (regardless of order)
The order of the words does not matter. So if I have one line for "dog food for sale" with cost = $10 and another string for "for she cat food" with cost of $5, then i will have one item in my table for "sale food" and it will show up as $15.
I want unique combinations then aggregate where string contains the single word (for ngram), contains both words in any order (bigram) and contains three words in any order. So like sum(cost) where word like "sale" (ngram). Sum(cost) where words like "sale" and words like "dog"... Etc
The raw data is imported directly into es and the final output I want is one table where column one has all the unique combinations and colum two the sum of the cost per respective combination.
What's the best way to do this? Thank you