Phrase Cloud

I've got an index of product reviews, in ES6.3 (can upgrade if necessary).

I'd like to create a 'phrase cloud' of significant phrases found in the reviews. Similar to how it's done on Amazon product reviews

I've started by indexing the 'detail' field of the document with a shingle token filter, and am retrieving the necessary data with a significant text aggregation query.

e.g.

"analysis": {
          "filter": {
            "phrase_shingles_token_filter": {
              "max_shingle_size": "4",
              "output_unigrams": "true",
              "type": "shingle",
              "filler_token": ""
            }
          },
          "analyzer": {      
            "phrase_shingles_analyzer": {
              "filter": [
                "standard",
                "lowercase",
                "stop",
                "phrase_shingles_token_filter",
                "trim"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        }, 

   "mappings": {
      "review": {
        "properties": {
          "createdAt": {
            "type": "date"
          },
          "detail": {
            "type": "text",
            "similarity": "BM25",
            "fields": {
              "shingles": {
                "type": "text",
                "similarity": "BM25",
                "analyzer": "phrase_shingles_analyzer"
              }
            }
          },
GET catalog.review/_search
{
  "size": 0,
  "query": {
    "term": {
      "entityPkValue": {
        "value": 4186
      }
    }
  },
  "aggregations" : {
        "word_cloud" : {
            "sampler" : {
                "shard_size" : 500
            },
            "aggregations": {
                "keywords" : {
                    "significant_text" : { 
                      "size": 20, 
                      "filter_duplicate_text": true,
                      "field" : "detail.shingles",
                      "exclude" : ["greenwork", "lawn", "mow"]
                    }
                }
            }
        }
    }
}

It kinda works, but there are a few issues with its relevance:

1 - Tokens from the product's name keep appearing in the results. I'd like to be able to retrieve a list of tokens from the product's title and exclude related tokens from the results. With the shingle tokens being whitespace separated words, the 'exclude' on the aggregation doesn't really work. It would be good if I could exclude tokens that are related to the title at index time.

2 - I'm not sure if I can boost longer phrases?

3 - Lots of similar words appear in the results. e.g. I get 'storage', 'storage box', 'storing' as significant phrases. Not sure if I can do much here?

I'm hoping I can retrieve the kind of results I want directly from Elasticsearch without having to process the data elsewhere. Also not sure if I'm on the right track?

Hi Dane,

White space should work in the exclude clause as the provided strings aren't analyzed.

They should naturally score higher I'd expect - longer text runs are rarer and, if repeated, significant. You could try experiment with some of the different significance heuristics eg percentage which will favour rarer things.

Rationalizing the results may be possible if you know, for example, that storage is only ever used in the context of storage box. To find out which of the significant text results are used with each other and independently put them as term filters in the adjacency_matrix aggregation as part of a follow-up search to get a feel for the co-occurrence of the items. This information should help you prune single words that only ever appear as part of phrases.

Hi Mark,

Thanks so much for your help. This should give me a lot to go on while improving the results.

To clarify what I was referring to with the 'exclude' option, Below is a query to get the reviews for a product called "DeWalt Cordless Drill Driver Combo"

GET catalog.review/_search
{
  "size": 0,
  "query": {
    "terms": {
      "entityPkValue": [3994]
    }
  },
  "aggregations" : {
        "word_cloud" : {
            "sampler" : {
                "shard_size" : 500
            },
            "aggregations": {
                "keywords" : {
                    "significant_text" : { 
                      "size": 10,
                      "filter_duplicate_text": true,
                      "field" : "detail.shingles",
                      "exclude" : ["dewalt", "cordless", "drill", "driver", "combo"]
                    }
                }
            }
        }
    }
}
[
          {
            "key": "drill combo",
            "doc_count": 6,
            "score": 3.7979363294048607,
            "bg_count": 6
          },
          {
            "key": "cordless drill",
            "doc_count": 5,
            "score": 3.164946941170718,
            "bg_count": 5
          },
          {
            "key": "combo set",
            "doc_count": 4,
            "score": 2.0199716367548537,
            "bg_count": 5
          },
          {
            "key": "drill  driver",
            "doc_count": 4,
            "score": 2.0199716367548537,
            "bg_count": 5
          },
          {
            "key": "torque",
            "doc_count": 4,
            "score": 2.0199716367548537,
            "bg_count": 5
          },
          {
            "key": "battery",
            "doc_count": 19,
            "score": 1.5173731862543052,
            "bg_count": 140
          },
          {
            "key": "battery life",
            "doc_count": 4,
            "score": 1.1097744524317952,
            "bg_count": 9
          },
          {
            "key": "charge",
            "doc_count": 6,
            "score": 0.8113518183448254,
            "bg_count": 27
          },
          {
            "key": "diy",
            "doc_count": 4,
            "score": 0.7596986123075419,
            "bg_count": 13
          }
        ]

A lot of the results are really good, but the first 4 could be tokens in the product's name, but are not excluded because the shingle tokens are multiple words and don't match the single words provided to exclude.

I could generate n-gram shingles from the product title in the client before performing the query. Maybe ES provides a mechanism for applying a token filter to the aggregation query, so that the ES could generate the excludes list from the title?

Another option I might look into is whether I could create an ES plugin that provides a custom token filter at index time. Hoping there's a way for the custom filter to look at another field, in order to generate a series of words to exclude.

The ‘_analyze’ api would allow you to feed the title text through the shingle analyzer to get the set of tokens for the exclude list. This would need to be a call from your client code before calling the significant text aggregation

That is pure genius! Thanks so much.

If your products are organised into categories another thing you might want to play with is the ‘background_filter’.
Rather than diffing a specific drill against all products you could diff the product’s reviews against just the other products in the power tools category. This might help focus in on product-specific things like “battery life” rather than category-specific words like “diy”.
It all depends on data volumes though for the signals to stand out.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.