Significant Term aggregation

Ramdev_Wudali · April 30, 2014, 6:54pm

Hi:
I have been trying to use (and successfully did) the Significant terms
aggregations in release 1.1.0. The blog posts about this feature
http://www.elasticsearch.org/blog/significant-terms-aggregation/ was
extremely helpful. Since this feature is in experimental stage and the
authors had requested feedback and me not knowing about how to provide
feedback regarding specific features, I am restarting to posting on this
group.

I had posted on a different thread regarding accessing the TFIDF scores for
terms so that I could investigate ways in which I could enhance my queries.
This lead me to look at the experimental Significant Terms Aggregation. It
does what it says quite well. and I am glad this functionality exists.
However, I would like to see some possibilities of enhancements:

What I noticed in my aggregation results is a lot of Stopwords (a, an,
the, at, and, etc.) being included as significant terms. perhaps having the
possibility of including Stopword lists so that these stop words are not
included in the signifiant term calculations. (The significance is
calculated based on how many times a term appears in the query result vs
how many times it appears in whole index. ) For common stop words this
calculation i going to make them very significant.

Another possible enhancement would be get a phrase significance (instead of
a single term, doing a multi term significance) would be nice.

In the blog post, a similar effect is obtained by highlighting the terms
that are identified as significant.But it would be nice to just look at the
buckets and determine that.

Cheers and Thanks for all the fish

Ramdev

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/95bec4ed-69c6-409d-b6b8-4bbe4c8da229%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · May 1, 2014, 3:04pm

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special treatment.
If they are appearing as suggestions then I expect one of the following
statements to be true:

You have a very small number of docs in the result set representing the
"foreground" sample. Significant terms needs a reasonable number of docs in
a sample to draw any real conclusions
You have query criteria that is not identifying a result set with any
sense of cohesion e.g. a query for random docs
You have changed the set of stopwords in use in your index - what
previously never used to appear at all is now suddenly common or
vice-versa.
You are querying across mixed indices or doc-types (one with stop-words,
one without) and we fail to tune-out the stopwords as part of the results
merging process because one small index reports them back as commonplace
while another large index has them as missing or rare. In the merged stats
they therefore appear to be highly correlated with your query request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/184ac2f6-12f4-47a8-86c4-9c49c04e17ac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramdev_Wudali · May 2, 2014, 1:31pm

Hi Mark:
Thanks for the update.
The corpus I am searching against is a news feed corpus and the number of
documents are not really that small. (some queries return in the result
set over 400K docs). and these being news articles, the documents are not
short twitter like sentences. Most of my query results have at least 10's
of thousands of documents if not more.

your second concern that the query criteria is not identifying a result set
with any sense of cohesion might be true. Basically the search I am
executing is a filter. Either the document metadata either has the value or
not. Hence the result set may not be "cohesive". The reason for me to use
the Significant terms is so that the query can be enhanced to provide a
more cohesive set of documents.

I am using the standard stop words list that comes with ES and have not
added to or removed from it.
I am, also , not querying across multiple indicies/types. (there is only
one index with one type within the index)

I will watch the video and see if I can get some ideas to improve my
queries.

All in all I find the new aggregations feature quite helpful. (at least to
generate some descriptive analytics)

Cheers

Ramdev

On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special treatment.
If they are appearing as suggestions then I expect one of the following
statements to be true:

You have a very small number of docs in the result set representing the
"foreground" sample. Significant terms needs a reasonable number of docs in
a sample to draw any real conclusions

You have query criteria that is not identifying a result set with any
sense of cohesion e.g. a query for random docs

You have changed the set of stopwords in use in your index - what
previously never used to appear at all is now suddenly common or
vice-versa.

You are querying across mixed indices or doc-types (one with
stop-words, one without) and we fail to tune-out the stopwords as part of
the results merging process because one small index reports them back as
commonplace while another large index has them as missing or rare. In the
merged stats they therefore appear to be highly correlated with your query
request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:
Revealing the Uncommonly Common with Elasticsearch | SkillsCast | 24th April 2014

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e74823bd-1f54-4c9d-88fb-62406ca46a9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · May 2, 2014, 2:07pm

your second concern that the query criteria is not identifying a result set

with any sense of cohesion might be true. Basically the search I am
executing is a filter. Either the document metadata either has the value or
not. Hence the result set may not be "cohesive". The reason for me to use
the Significant terms is so that the query can be enhanced to provide a
more cohesive set of documents.

We can probably debug that from the results of the agg. For each
"significant" term you should get a score and all the ingredients that went
into it are also available:

The number of docs in the result set with the given term
The size of your result set
The number of docs in the index with the given term (see the "bg_count"
value)
The size of the index

In a "cohesive" set you should see a reasonable difference in the term
probabilities e.g. the numbers 1/2 vs 3/4
If all you've selected in your query is effectively random docs with no
common theme then the use of words in background and foreground barely
differ and 1/2 vs 3/4 are practically the same giving a poor-scoring set of
results.

On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special treatment.
If they are appearing as suggestions then I expect one of the following
statements to be true:

You have a very small number of docs in the result set representing
the "foreground" sample. Significant terms needs a reasonable number of
docs in a sample to draw any real conclusions

You have query criteria that is not identifying a result set with any
sense of cohesion e.g. a query for random docs

You have changed the set of stopwords in use in your index - what
previously never used to appear at all is now suddenly common or
vice-versa.

You are querying across mixed indices or doc-types (one with
stop-words, one without) and we fail to tune-out the stopwords as part of
the results merging process because one small index reports them back as
commonplace while another large index has them as missing or rare. In the
merged stats they therefore appear to be highly correlated with your query
request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:
Revealing the Uncommonly Common with Elasticsearch | SkillsCast | 24th April 2014

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramdev_Wudali · May 2, 2014, 6:03pm

I think I should clarify something. Even though my query is essentially a
filter, the "significant terms" aggregation is run against the body of the
documents (which is typical prose in a news document).

here is an example :

Query : <Query index to find docs with a Specific String in field
"Class_Text" > with aggregation (Significant Terms) on the Body of the
document:
POST _search
{
"size" : 0,
"query" : {
"nested" : {
"query" : {
"match" : {
"Class_Text" : {
"query" : "Fuel Cell & Battery",
"type" : "boolean"
}
}
},
"path" : "SMART_TERM"
}
},
"aggregations" : {
"sigTerms" : {
"significant_terms" : {
"field" : "BODY.v",
"size" : 1000
}
}
}
}

......
{
"key": "resistance",
"doc_count": 68795,
"score": 53.42999474620047,
"bg_count": 129149
},
{
"key": "patented",
"doc_count": 42848,
"score": 50.98806065128648,
"bg_count": 52548
},
{
"key": "marketintelligencecenter.com's",
"doc_count": 33701,
"score": 48.58994469232905,
"bg_count": 34122
},
{
"key": "for",
"doc_count": 427040,
"score": 47.73227955829178,
"bg_count": 5483708
},
{
"key": "html",
"doc_count": 91658,
"score": 46.79933234224686,
"bg_count": 261374
},
{
"key": "an",
"doc_count": 348706,
"score": 43.20270422802958,
"bg_count": 4046974
},
{
"key": "protection",
"doc_count": 80987,
"score": 43.187880126230326,
"bg_count": 221159
},
{
"key": "of",
"doc_count": 430217,
"score": 42.90990816758588,
"bg_count": 6177535
},
{
"key": "by",
"doc_count": 364873,
"score": 42.68719313911975,
"bg_count": 4480098
},
.......

as you can see words like for an of by are showing up in the aggregations
list with pretty decent scores to put them in the top 50 significant terms.

The documents get tagged with Class_Text after being classified and that
value is being queried in the query.

In my case it would be more helpful if I am able to get Phrases rather than
terms. (I am yet to finish watching your presentation).

let me know if you have any insight .

Thanks much

Ramdev

On Fri, May 2, 2014 at 9:07 AM, Mark Harwood <mark.harwood@elasticsearch.com

wrote:

your second concern that the query criteria is not identifying a result

set with any sense of cohesion might be true. Basically the search I am
executing is a filter. Either the document metadata either has the value or
not. Hence the result set may not be "cohesive". The reason for me to use
the Significant terms is so that the query can be enhanced to provide a
more cohesive set of documents.

We can probably debug that from the results of the agg. For each
"significant" term you should get a score and all the ingredients that went
into it are also available:

The number of docs in the result set with the given term

The size of your result set

The number of docs in the index with the given term (see the "bg_count"
value)

The size of the index

In a "cohesive" set you should see a reasonable difference in the term
probabilities e.g. the numbers 1/2 vs 3/4
If all you've selected in your query is effectively random docs with no
common theme then the use of words in background and foreground barely
differ and 1/2 vs 3/4 are practically the same giving a poor-scoring set of
results.

On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special
treatment. If they are appearing as suggestions then I expect one of the
following statements to be true:

You have a very small number of docs in the result set representing
the "foreground" sample. Significant terms needs a reasonable number of
docs in a sample to draw any real conclusions

You have query criteria that is not identifying a result set with any
sense of cohesion e.g. a query for random docs

You have changed the set of stopwords in use in your index - what
previously never used to appear at all is now suddenly common or
vice-versa.

You are querying across mixed indices or doc-types (one with
stop-words, one without) and we fail to tune-out the stopwords as part of
the results merging process because one small index reports them back as
commonplace while another large index has them as missing or rare. In the
merged stats they therefore appear to be highly correlated with your query
request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance
(instead of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:
Revealing the Uncommonly Common with Elasticsearch | SkillsCast | 24th April 2014
the-uncommonly-common-with-elasticsearch

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OIorUFaI-KY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGbqZ7i8PZrYYDZasE4d2YF3MHcC8_oG4F7Es%2BuPjAgi97wxEA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · May 2, 2014, 10:37pm

So there's potentially several things going on here:

Your query may be too broad - depending on how your analysis is set up
you are likely querying for [fuel] OR [cell] OR [battery] as independent
words meaning you'll match a lot of docs e.g. those mentioning only "fuel
prices" etc. This reduces the "cohesion" of the topics covered in the
result set. Consider use of ANDs or phrases on free-text queries or use
untokenized category fields to tighten up the result set.
Some of your docs look to cover many diverse topics in one doc e.g. this
one mentions fuel, facebook and a
drugstore: Market Intelligence Center - Finance & Crypto News Can
these multi-story pages be filtered out somehow?
Do your bodies have standard "boilerplate" text common to many pages?
e.g. the author's biography as shown here:
Market Intelligence Center - Finance & Crypto News If so then the
repetition of a common passage may make certain words undesirably highly
correlated with a topic because the author who covers that industry sector
likely has his biography in every related page and words from his biography
e.g. a university will be skewed in that industry sector.

So reasonably clean, on-topic data is required to derive anything sensible
using this statistical approach.

On Wednesday, April 30, 2014 7:54:17 PM UTC+1, Ramdev Wudali wrote:

Hi:
I have been trying to use (and successfully did) the Significant terms
aggregations in release 1.1.0. The blog posts about this feature
Elasticsearch Platform — Find real-time answers at scale | Elastic was
extremely helpful. Since this feature is in experimental stage and the
authors had requested feedback and me not knowing about how to provide
feedback regarding specific features, I am restarting to posting on this
group.

I had posted on a different thread regarding accessing the TFIDF scores
for terms so that I could investigate ways in which I could enhance my
queries. This lead me to look at the experimental Significant Terms
Aggregation. It does what it says quite well. and I am glad this
functionality exists. However, I would like to see some possibilities of
enhancements:

What I noticed in my aggregation results is a lot of Stopwords (a, an,
the, at, and, etc.) being included as significant terms. perhaps having the
possibility of including Stopword lists so that these stop words are not
included in the signifiant term calculations. (The significance is
calculated based on how many times a term appears in the query result vs
how many times it appears in whole index. ) For common stop words this
calculation i going to make them very significant.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

In the blog post, a similar effect is obtained by highlighting the terms
that are identified as significant.But it would be nice to just look at the
buckets and determine that.

Cheers and Thanks for all the fish

Ramdev

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8c9c4fb-e0db-44d6-917d-69cdc5d16dad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · May 2, 2014, 11:17pm

Pages like this suggest where the terms "patented" "resistance" and
"marketintelligence.com's" are being picked
up: http://www.marketintelligencecenter.com/artificialintelligence.aspx?p=4
Much of it looks machine-generated.

Too much repetition of stock phrases mixed in with diverse topics make it
hard to pick up any kind of signal if this is the content you are including
in your searches.

Cheers,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6960583-66a2-463c-b2ec-7797c47ccaaa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramdev_Wudali · May 5, 2014, 1:36pm

Hi mark:
That was just one example. The Documents were news articles. Hence the
broad coverage and not specific on -topic documents. Since this is news
from third party sources, I do not have control over what comes into the
index. (i.e. separate the machine generated from manually edited/curated).
That said, I could perhaps whittle the content down by making sure that the
documents processed are indeed worthy news articles and not random blog
posts and non-releavnt docs.

I do agree with your earlier comment that the query may be too broad. As I
have already mentioned, Its news articles. If these news articles (which
are provided by various sources) come with boilerplate text, Other than
process the document to remove it I cannot do much else. (for now we are
not looking into removing the boilerplate text as it might provide us with
some insight into other information).

The initial investigative exercise in using the Significant terms was to
identify terms that could perhaps enhance the content returned. There is
of course some manual editing of the significant terms to remove
nonsensical terms(in context, of course) to get to the final list of terms
to be added to my query.

Is tehre other functionality (experimental or otherwise) within ES that can
help me do this ?

On Friday, 2 May 2014 18:17:41 UTC-5, Mark Harwood wrote:

Pages like this suggest where the terms "patented" "resistance" and "
marketintelligence.com's" are being picked up:
http://www.marketintelligencecenter.com/artificialintelligence.aspx?p=4
Much of it looks machine-generated.

Too much repetition of stock phrases mixed in with diverse topics make it
hard to pick up any kind of signal if this is the content you are including
in your searches.

Cheers,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ac541fd0-4143-47dc-a694-f770e0236b7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · May 6, 2014, 10:18am

Hi Ramdev,

Is tehre other functionality (experimental or otherwise) within ES that
can help me do this ?

I'd recommend splitting HTML files that are clearly referencing multiple
diverse news stories into multiple ES documents based on title headings or
whatever indicates the start/end of each news item.

For boilerplate-removal I have previously used this analyzer on an earlier
incarnation of the significant_terms algo:
[LUCENE-725] NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text - ASF JIRA

Cheers
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ae098032-ac92-4de3-a0f5-681d3b4c1031%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How to lower the significance of a certain phrase Elasticsearch	5	1524	July 6, 2017
Significant term aggregation with Snowball analyzer Elasticsearch	11	808	January 14, 2019
Significant terms aggregation too slow for me Elasticsearch	2	533	July 6, 2017
Significant terms aggregation with non tokenized text Elasticsearch	2	489	July 6, 2017
Signification terms per document Elasticsearch	3	321	May 11, 2018

Significant Term aggregation

Related topics