Significant Term aggregation

Hi:
I have been trying to use (and successfully did) the Significant terms
aggregations in release 1.1.0. The blog posts about this feature
http://www.elasticsearch.org/blog/significant-terms-aggregation/ was
extremely helpful. Since this feature is in experimental stage and the
authors had requested feedback and me not knowing about how to provide
feedback regarding specific features, I am restarting to posting on this
group.

I had posted on a different thread regarding accessing the TFIDF scores for
terms so that I could investigate ways in which I could enhance my queries.
This lead me to look at the experimental Significant Terms Aggregation. It
does what it says quite well. and I am glad this functionality exists.
However, I would like to see some possibilities of enhancements:

What I noticed in my aggregation results is a lot of Stopwords (a, an,
the, at, and, etc.) being included as significant terms. perhaps having the
possibility of including Stopword lists so that these stop words are not
included in the signifiant term calculations. (The significance is
calculated based on how many times a term appears in the query result vs
how many times it appears in whole index. ) For common stop words this
calculation i going to make them very significant.

Another possible enhancement would be get a phrase significance (instead of
a single term, doing a multi term significance) would be nice.

In the blog post, a similar effect is obtained by highlighting the terms
that are identified as significant.But it would be nice to just look at the
buckets and determine that.

Cheers and Thanks for all the fish

Ramdev

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/95bec4ed-69c6-409d-b6b8-4bbe4c8da229%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special treatment.
If they are appearing as suggestions then I expect one of the following
statements to be true:

  1. You have a very small number of docs in the result set representing the
    "foreground" sample. Significant terms needs a reasonable number of docs in
    a sample to draw any real conclusions
  2. You have query criteria that is not identifying a result set with any
    sense of cohesion e.g. a query for random docs
  3. You have changed the set of stopwords in use in your index - what
    previously never used to appear at all is now suddenly common or
    vice-versa.
  4. You are querying across mixed indices or doc-types (one with stop-words,
    one without) and we fail to tune-out the stopwords as part of the results
    merging process because one small index reports them back as commonplace
    while another large index has them as missing or rare. In the merged stats
    they therefore appear to be highly correlated with your query request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/184ac2f6-12f4-47a8-86c4-9c49c04e17ac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Mark:
Thanks for the update.
The corpus I am searching against is a news feed corpus and the number of
documents are not really that small. (some queries return in the result
set over 400K docs). and these being news articles, the documents are not
short twitter like sentences. Most of my query results have at least 10's
of thousands of documents if not more.

your second concern that the query criteria is not identifying a result set
with any sense of cohesion might be true. Basically the search I am
executing is a filter. Either the document metadata either has the value or
not. Hence the result set may not be "cohesive". The reason for me to use
the Significant terms is so that the query can be enhanced to provide a
more cohesive set of documents.

I am using the standard stop words list that comes with ES and have not
added to or removed from it.
I am, also , not querying across multiple indicies/types. (there is only
one index with one type within the index)

I will watch the video and see if I can get some ideas to improve my
queries.

All in all I find the new aggregations feature quite helpful. (at least to
generate some descriptive analytics)

Cheers

Ramdev

On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special treatment.
If they are appearing as suggestions then I expect one of the following
statements to be true:

  1. You have a very small number of docs in the result set representing the
    "foreground" sample. Significant terms needs a reasonable number of docs in
    a sample to draw any real conclusions
  2. You have query criteria that is not identifying a result set with any
    sense of cohesion e.g. a query for random docs
  3. You have changed the set of stopwords in use in your index - what
    previously never used to appear at all is now suddenly common or
    vice-versa.
  4. You are querying across mixed indices or doc-types (one with
    stop-words, one without) and we fail to tune-out the stopwords as part of
    the results merging process because one small index reports them back as
    commonplace while another large index has them as missing or rare. In the
    merged stats they therefore appear to be highly correlated with your query
    request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:
Revealing the Uncommonly Common with Elasticsearch | SkillsCast | 24th April 2014

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e74823bd-1f54-4c9d-88fb-62406ca46a9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

your second concern that the query criteria is not identifying a result set

with any sense of cohesion might be true. Basically the search I am
executing is a filter. Either the document metadata either has the value or
not. Hence the result set may not be "cohesive". The reason for me to use
the Significant terms is so that the query can be enhanced to provide a
more cohesive set of documents.

We can probably debug that from the results of the agg. For each
"significant" term you should get a score and all the ingredients that went
into it are also available:

  1. The number of docs in the result set with the given term
  2. The size of your result set
  3. The number of docs in the index with the given term (see the "bg_count"
    value)
  4. The size of the index

In a "cohesive" set you should see a reasonable difference in the term
probabilities e.g. the numbers 1/2 vs 3/4
If all you've selected in your query is effectively random docs with no
common theme then the use of words in background and foreground barely
differ and 1/2 vs 3/4 are practically the same giving a poor-scoring set of
results.

On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special treatment.
If they are appearing as suggestions then I expect one of the following
statements to be true:

  1. You have a very small number of docs in the result set representing
    the "foreground" sample. Significant terms needs a reasonable number of
    docs in a sample to draw any real conclusions
  2. You have query criteria that is not identifying a result set with any
    sense of cohesion e.g. a query for random docs
  3. You have changed the set of stopwords in use in your index - what
    previously never used to appear at all is now suddenly common or
    vice-versa.
  4. You are querying across mixed indices or doc-types (one with
    stop-words, one without) and we fail to tune-out the stopwords as part of
    the results merging process because one small index reports them back as
    commonplace while another large index has them as missing or rare. In the
    merged stats they therefore appear to be highly correlated with your query
    request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:
Revealing the Uncommonly Common with Elasticsearch | SkillsCast | 24th April 2014

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I think I should clarify something. Even though my query is essentially a
filter, the "significant terms" aggregation is run against the body of the
documents (which is typical prose in a news document).

here is an example :

Query : <Query index to find docs with a Specific String in field
"Class_Text" > with aggregation (Significant Terms) on the Body of the
document:
POST _search
{
"size" : 0,
"query" : {
"nested" : {
"query" : {
"match" : {
"Class_Text" : {
"query" : "Fuel Cell & Battery",
"type" : "boolean"
}
}
},
"path" : "SMART_TERM"
}
},
"aggregations" : {
"sigTerms" : {
"significant_terms" : {
"field" : "BODY.v",
"size" : 1000
}
}
}
}

......
{
"key": "resistance",
"doc_count": 68795,
"score": 53.42999474620047,
"bg_count": 129149
},
{
"key": "patented",
"doc_count": 42848,
"score": 50.98806065128648,
"bg_count": 52548
},
{
"key": "marketintelligencecenter.com's",
"doc_count": 33701,
"score": 48.58994469232905,
"bg_count": 34122
},
{
"key": "for",
"doc_count": 427040,
"score": 47.73227955829178,
"bg_count": 5483708
},
{
"key": "html",
"doc_count": 91658,
"score": 46.79933234224686,
"bg_count": 261374
},
{
"key": "an",
"doc_count": 348706,
"score": 43.20270422802958,
"bg_count": 4046974
},
{
"key": "protection",
"doc_count": 80987,
"score": 43.187880126230326,
"bg_count": 221159
},
{
"key": "of",
"doc_count": 430217,
"score": 42.90990816758588,
"bg_count": 6177535
},
{
"key": "by",
"doc_count": 364873,
"score": 42.68719313911975,
"bg_count": 4480098
},
.......

as you can see words like for an of by are showing up in the aggregations
list with pretty decent scores to put them in the top 50 significant terms.

The documents get tagged with Class_Text after being classified and that
value is being queried in the query.

In my case it would be more helpful if I am able to get Phrases rather than
terms. (I am yet to finish watching your presentation).

let me know if you have any insight .

Thanks much

Ramdev

On Fri, May 2, 2014 at 9:07 AM, Mark Harwood <mark.harwood@elasticsearch.com

wrote:

your second concern that the query criteria is not identifying a result

set with any sense of cohesion might be true. Basically the search I am
executing is a filter. Either the document metadata either has the value or
not. Hence the result set may not be "cohesive". The reason for me to use
the Significant terms is so that the query can be enhanced to provide a
more cohesive set of documents.

We can probably debug that from the results of the agg. For each
"significant" term you should get a score and all the ingredients that went
into it are also available:

  1. The number of docs in the result set with the given term
  2. The size of your result set
  3. The number of docs in the index with the given term (see the "bg_count"
    value)
  4. The size of the index

In a "cohesive" set you should see a reasonable difference in the term
probabilities e.g. the numbers 1/2 vs 3/4
If all you've selected in your query is effectively random docs with no
common theme then the use of words in background and foreground barely
differ and 1/2 vs 3/4 are practically the same giving a poor-scoring set of
results.

On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:

Thanks for the feedback, Ramdev.

What I noticed in my aggregation results is a lot of Stopwords (a, an,

the, at, and, etc.) being included as significant terms.

These sorts of terms shouldn't really need any sort of special
treatment. If they are appearing as suggestions then I expect one of the
following statements to be true:

  1. You have a very small number of docs in the result set representing
    the "foreground" sample. Significant terms needs a reasonable number of
    docs in a sample to draw any real conclusions
  2. You have query criteria that is not identifying a result set with any
    sense of cohesion e.g. a query for random docs
  3. You have changed the set of stopwords in use in your index - what
    previously never used to appear at all is now suddenly common or
    vice-versa.
  4. You are querying across mixed indices or doc-types (one with
    stop-words, one without) and we fail to tune-out the stopwords as part of
    the results merging process because one small index reports them back as
    commonplace while another large index has them as missing or rare. In the
    merged stats they therefore appear to be highly correlated with your query
    request.

Please let me know if none of these scenarios explain your results.

Another possible enhancement would be get a phrase significance
(instead of a single term, doing a multi term significance) would be nice.

I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:
Revealing the Uncommonly Common with Elasticsearch | SkillsCast | 24th April 2014
the-uncommonly-common-with-elasticsearch

Cheers and Thanks for all the fish

You're welcome and thanks again for the feedback
Mark

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OIorUFaI-KY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGbqZ7i8PZrYYDZasE4d2YF3MHcC8_oG4F7Es%2BuPjAgi97wxEA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

So there's potentially several things going on here:

  1. Your query may be too broad - depending on how your analysis is set up
    you are likely querying for [fuel] OR [cell] OR [battery] as independent
    words meaning you'll match a lot of docs e.g. those mentioning only "fuel
    prices" etc. This reduces the "cohesion" of the topics covered in the
    result set. Consider use of ANDs or phrases on free-text queries or use
    untokenized category fields to tighten up the result set.
  2. Some of your docs look to cover many diverse topics in one doc e.g. this
    one mentions fuel, facebook and a
    drugstore: Market Intelligence Center - Finance & Crypto News Can
    these multi-story pages be filtered out somehow?
  3. Do your bodies have standard "boilerplate" text common to many pages?
    e.g. the author's biography as shown here:
    Market Intelligence Center - Finance & Crypto News If so then the
    repetition of a common passage may make certain words undesirably highly
    correlated with a topic because the author who covers that industry sector
    likely has his biography in every related page and words from his biography
    e.g. a university will be skewed in that industry sector.

So reasonably clean, on-topic data is required to derive anything sensible
using this statistical approach.

On Wednesday, April 30, 2014 7:54:17 PM UTC+1, Ramdev Wudali wrote:

Hi:
I have been trying to use (and successfully did) the Significant terms
aggregations in release 1.1.0. The blog posts about this feature
Elasticsearch Platform — Find real-time answers at scale | Elastic was
extremely helpful. Since this feature is in experimental stage and the
authors had requested feedback and me not knowing about how to provide
feedback regarding specific features, I am restarting to posting on this
group.

I had posted on a different thread regarding accessing the TFIDF scores
for terms so that I could investigate ways in which I could enhance my
queries. This lead me to look at the experimental Significant Terms
Aggregation. It does what it says quite well. and I am glad this
functionality exists. However, I would like to see some possibilities of
enhancements:

What I noticed in my aggregation results is a lot of Stopwords (a, an,
the, at, and, etc.) being included as significant terms. perhaps having the
possibility of including Stopword lists so that these stop words are not
included in the signifiant term calculations. (The significance is
calculated based on how many times a term appears in the query result vs
how many times it appears in whole index. ) For common stop words this
calculation i going to make them very significant.

Another possible enhancement would be get a phrase significance (instead
of a single term, doing a multi term significance) would be nice.

In the blog post, a similar effect is obtained by highlighting the terms
that are identified as significant.But it would be nice to just look at the
buckets and determine that.

Cheers and Thanks for all the fish

Ramdev

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8c9c4fb-e0db-44d6-917d-69cdc5d16dad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pages like this suggest where the terms "patented" "resistance" and
"marketintelligence.com's" are being picked
up: http://www.marketintelligencecenter.com/artificialintelligence.aspx?p=4
Much of it looks machine-generated.

Too much repetition of stock phrases mixed in with diverse topics make it
hard to pick up any kind of signal if this is the content you are including
in your searches.

Cheers,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6960583-66a2-463c-b2ec-7797c47ccaaa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi mark:
That was just one example. The Documents were news articles. Hence the
broad coverage and not specific on -topic documents. Since this is news
from third party sources, I do not have control over what comes into the
index. (i.e. separate the machine generated from manually edited/curated).
That said, I could perhaps whittle the content down by making sure that the
documents processed are indeed worthy news articles and not random blog
posts and non-releavnt docs.

I do agree with your earlier comment that the query may be too broad. As I
have already mentioned, Its news articles. If these news articles (which
are provided by various sources) come with boilerplate text, Other than
process the document to remove it I cannot do much else. (for now we are
not looking into removing the boilerplate text as it might provide us with
some insight into other information).

The initial investigative exercise in using the Significant terms was to
identify terms that could perhaps enhance the content returned. There is
of course some manual editing of the significant terms to remove
nonsensical terms(in context, of course) to get to the final list of terms
to be added to my query.

Is tehre other functionality (experimental or otherwise) within ES that can
help me do this ?

On Friday, 2 May 2014 18:17:41 UTC-5, Mark Harwood wrote:

Pages like this suggest where the terms "patented" "resistance" and "
marketintelligence.com's" are being picked up:
http://www.marketintelligencecenter.com/artificialintelligence.aspx?p=4
Much of it looks machine-generated.

Too much repetition of stock phrases mixed in with diverse topics make it
hard to pick up any kind of signal if this is the content you are including
in your searches.

Cheers,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ac541fd0-4143-47dc-a694-f770e0236b7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Ramdev,

Is tehre other functionality (experimental or otherwise) within ES that
can help me do this ?

I'd recommend splitting HTML files that are clearly referencing multiple
diverse news stories into multiple ES documents based on title headings or
whatever indicates the start/end of each news item.

For boilerplate-removal I have previously used this analyzer on an earlier
incarnation of the significant_terms algo:
[LUCENE-725] NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text - ASF JIRA

Cheers
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ae098032-ac92-4de3-a0f5-681d3b4c1031%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.