URGENT: PLEASE HELP Unique list of bigram and trigram combinations then aggregate stats

I'm currently solving this poorly, not leveraging es. We recently scaled our data and my system crashes. I need an efficient solution to get back online. Thank you!

I have a list of 300,000 unique strings with different number of words in each. Each string has statistics associated with it.

I want to run a ngram, bigram, and trigram and aggregate the statistics by each unique combination (regardless of order)

The order of the words does not matter. So if I have one line for "dog food for sale" with cost = $10 and another string for "for she cat food" with cost of $5, then i will have one item in my table for "sale food" and it will show up as $15.

I want unique combinations then aggregate where string contains the single word (for ngram), contains both words in any order (bigram) and contains three words in any order. So like sum(cost) where word like "sale" (ngram). Sum(cost) where words like "sale" and words like "dog"... Etc

The raw data is imported directly into es and the final output I want is one table where column one has all the unique combinations and colum two the sum of the cost per respective combination.

What's the best way to do this? Thank you :blush:

It is not very clear to me what are your ES documents? Are you generating ngrams, bigrams and trigrams yourself and indexing them as documents? Or do you want ES extract them? If you want ES do this, ES has Shingle Token Filter that can generate word ngrams, but the order is predefined.

It is not very clear either what are your ES queries? What do you intend to search for? Or aggregate?

Thank you Maya,

I want to break out all the words into ngram, bi gram and trigram, then all those unique items would be in one column Like
dog food online 5
cat food online 5

food 10
online 10
dog 5
cat 5
dog food 5
cat food 5
food online 10
dog food online 5
cat food online 5
(I know it will generate more, but ill keep it simple)

I dont care about the order when aggregating. So SUM(where word contains all strings in any order)

Can we do all this in Elastic? Or do I need to do the ngram, bigram, and trigram in elastic, export, then aggregate.... this is what Im doing now, but when I try to put back the aggregated data into elastic it crashes.

Is this possible?

I don't think Elasticsearch is designed for your task.
What aggregation are you using?

Do you offer paid consulting? I'd love to share my screen and get a final answer :slight_smile:

Do I'm looking to create a unique list of ngram - straight forward..

But then I need to generate a list of unique combinations of two words and three words. Because I don't care about the order, I don't think this is bigram and trigram.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.