Significant term aggregation with Snowball analyzer

Hi,

I am using elasticsearch snowball analyzer for product field in an index. I need to get significant terms from elasticsearch aggregation(significant terms aggregation), but the results are not correct. The problem is that the resultant terms I am getting are with not exact as in my resultant part.
For example - productDescription in hits are like -
"SWEET BISCUITS - MILK BIKIS MILK CREAM"
"BRITTANNIA PRODUCTS: MILK BIKIES CREAM 1 00GM X 100NOS"

and significant term I am getting is -

{
"key": "biki",
"doc_count": 4,
"score": 553.7252991452992,
"bg_count": 260
}

Please suggest how can I get correct results like ("BIKIES", "BIKIS")

Below is the query sample -
{
"_source": {
"include": [
"productDescription"
]
},
"query": {
"bool": {
"filter": [{
"query_string": {
"default_field": "productDescription.SnowField",
"default_operator": "AND",
"query": "(milk cream)"
}
},
{
"term": {
"isUnique": true
}
},
{
"range": {
"date": {
"gte": "2018-01-01",
"lte": "2018-12-31"
}
}
}
],
"must": ,
"must_not":
}
},
"sort": [{
"date": {
"order": "desc"
}
}],
"size": 500,
"aggs": {
"my_sample": {
"sampler": {
"shard_size": 20
},
"aggregations": {
"keywords": {
"significant_text": {
"field": "productDescription.SnowField",
"size": 10,
"filter_duplicate_text": true
}
}
}
}
}
}

Use a field with an analyzer that doesn't stem or lowercase e.g. the "whitespace" analyzer

Hi Mark,

Thanks for the reply. The problem is not lowercase results, the problem is - I got "biki" from significant term aggregation while I need it as it is like "bikies" and "bikis" as you see it is in hits returned from . This was sample result, as I checked with different searches, I got many words which were mis-spelled(removed s/es i.e. without plural parts). But required is to get meaningful words(suggestions).

That's what "stemming" does.

Check out this blog which includes an example of taking potentially stemmed significant terms and using them in a terms query with a highlighter to show KWIC (Keywords In Context) examples of the discovered terms in text.
Note it talks about significant_terms rather than the new significant_text aggregation but the same principles still hold.

Hi Mark,

Yes I knew it. That is due to snowball analyzer as I mentioned above. I am using snowball in query as I need to include sound like words in results. And I also tried with removing SnowBall analyzer from aggregation and tried keyword analyzer as well in aggregation field but did not got exact results.

But is there any way I can get exact results like if any way if I need to reindex data with any other analyzer to get significant results or something else by which I can get aggregation results as they exists in productDescription field?

It depends.
If your docs were orders where you wanted to know "which products are typically also bought with pasta?" then you might use a keyword field and significant_terms because you'd be examining significant patterns in repeated orders for exactly the same product.
If your docs were products you'd (hopefully) only ever have exactly one unique product description so the keyword field would be of no use with any significance analysis (everything occurs once). If you were looking at some of the ingredients in the text of these descriptions (eg. common ingredients mentioned in high-fat products) then you might use an analyzed text field and significant_text. Maybe indexing with shingles would help too. Remember the indexed field you search on can be different (eg stemmed) from the indexed field you use for significant_text analysis (e.g. whitespace)

Hi Mark,

If i am using the significant term on multiple indexes, so how can we specify the missing terms

Significant terms is a tool for discovering terms - I don't follow why you're asking a question about specifying them?

Hi Mark,

Let me explain you few things.. I have 2 indexes.. I created same alias name on these so that I can search on these at once. In one index, i have field name productDescription and in second index it is productDesc. So the issue in getting significant terms is that when I pass productDescription field name in aggregation, it says that - "Aggregation [keywords] cannot process field [productDescription.StandardField] since it is not present". So is there any way by which I can pass two fields in significant term aggregation or otherwise can ignore it anyhow(Like we pass "missing" property in terms aggregation, but that is not supported in significant terms aggregation.

Ah. So missing "fields".
If the overall goal is to blend the term stats from 2 fields in 2 indices the answer is "no".
Generally, significant terms will work best on a single index and single shard since all of the stats are available in one place. If you're trying to use it to spot low-frequency terms (e.g. something that only occurs twice) in a distributed system that makes life hard because every single-occurrence term on a local shard (of which there are typically many) suddenly becomes a candidate for global consideration.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.