I've got an index of product reviews, in ES6.3 (can upgrade if necessary).
I'd like to create a 'phrase cloud' of significant phrases found in the reviews. Similar to how it's done on Amazon product reviews
I've started by indexing the 'detail' field of the document with a shingle token filter, and am retrieving the necessary data with a significant text aggregation query.
e.g.
"analysis": {
"filter": {
"phrase_shingles_token_filter": {
"max_shingle_size": "4",
"output_unigrams": "true",
"type": "shingle",
"filler_token": ""
}
},
"analyzer": {
"phrase_shingles_analyzer": {
"filter": [
"standard",
"lowercase",
"stop",
"phrase_shingles_token_filter",
"trim"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"mappings": {
"review": {
"properties": {
"createdAt": {
"type": "date"
},
"detail": {
"type": "text",
"similarity": "BM25",
"fields": {
"shingles": {
"type": "text",
"similarity": "BM25",
"analyzer": "phrase_shingles_analyzer"
}
}
},
GET catalog.review/_search
{
"size": 0,
"query": {
"term": {
"entityPkValue": {
"value": 4186
}
}
},
"aggregations" : {
"word_cloud" : {
"sampler" : {
"shard_size" : 500
},
"aggregations": {
"keywords" : {
"significant_text" : {
"size": 20,
"filter_duplicate_text": true,
"field" : "detail.shingles",
"exclude" : ["greenwork", "lawn", "mow"]
}
}
}
}
}
}
It kinda works, but there are a few issues with its relevance:
1 - Tokens from the product's name keep appearing in the results. I'd like to be able to retrieve a list of tokens from the product's title and exclude related tokens from the results. With the shingle tokens being whitespace separated words, the 'exclude' on the aggregation doesn't really work. It would be good if I could exclude tokens that are related to the title at index time.
2 - I'm not sure if I can boost longer phrases?
3 - Lots of similar words appear in the results. e.g. I get 'storage', 'storage box', 'storing' as significant phrases. Not sure if I can do much here?
I'm hoping I can retrieve the kind of results I want directly from Elasticsearch without having to process the data elsewhere. Also not sure if I'm on the right track?