I think I should clarify something. Even though my query is essentially a
filter, the "significant terms" aggregation is run against the body of the
documents (which is typical prose in a news document).
here is an example :
Query : <Query index to find docs with a Specific String in field
"Class_Text" > with aggregation (Significant Terms) on the Body of the
document:
POST _search
{
"size" : 0,
"query" : {
"nested" : {
"query" : {
"match" : {
"Class_Text" : {
"query" : "Fuel Cell & Battery",
"type" : "boolean"
}
}
},
"path" : "SMART_TERM"
}
},
"aggregations" : {
"sigTerms" : {
"significant_terms" : {
"field" : "BODY.v",
"size" : 1000
}
}
}
}
......
{
"key": "resistance",
"doc_count": 68795,
"score": 53.42999474620047,
"bg_count": 129149
},
{
"key": "patented",
"doc_count": 42848,
"score": 50.98806065128648,
"bg_count": 52548
},
{
"key": "marketintelligencecenter.com's",
"doc_count": 33701,
"score": 48.58994469232905,
"bg_count": 34122
},
{
"key": "for",
"doc_count": 427040,
"score": 47.73227955829178,
"bg_count": 5483708
},
{
"key": "html",
"doc_count": 91658,
"score": 46.79933234224686,
"bg_count": 261374
},
{
"key": "an",
"doc_count": 348706,
"score": 43.20270422802958,
"bg_count": 4046974
},
{
"key": "protection",
"doc_count": 80987,
"score": 43.187880126230326,
"bg_count": 221159
},
{
"key": "of",
"doc_count": 430217,
"score": 42.90990816758588,
"bg_count": 6177535
},
{
"key": "by",
"doc_count": 364873,
"score": 42.68719313911975,
"bg_count": 4480098
},
.......
as you can see words like for an of by are showing up in the aggregations
list with pretty decent scores to put them in the top 50 significant terms.
The documents get tagged with Class_Text after being classified and that
value is being queried in the query.
In my case it would be more helpful if I am able to get Phrases rather than
terms. (I am yet to finish watching your presentation).
let me know if you have any insight .
Thanks much
Ramdev
On Fri, May 2, 2014 at 9:07 AM, Mark Harwood <mark.harwood@elasticsearch.com
wrote:
your second concern that the query criteria is not identifying a result
set with any sense of cohesion might be true. Basically the search I am
executing is a filter. Either the document metadata either has the value or
not. Hence the result set may not be "cohesive". The reason for me to use
the Significant terms is so that the query can be enhanced to provide a
more cohesive set of documents.
We can probably debug that from the results of the agg. For each
"significant" term you should get a score and all the ingredients that went
into it are also available:
- The number of docs in the result set with the given term
- The size of your result set
- The number of docs in the index with the given term (see the "bg_count"
value)
- The size of the index
In a "cohesive" set you should see a reasonable difference in the term
probabilities e.g. the numbers 1/2 vs 3/4
If all you've selected in your query is effectively random docs with no
common theme then the use of words in background and foreground barely
differ and 1/2 vs 3/4 are practically the same giving a poor-scoring set of
results.
On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:
Thanks for the feedback, Ramdev.
What I noticed in my aggregation results is a lot of Stopwords (a, an,
the, at, and, etc.) being included as significant terms.
These sorts of terms shouldn't really need any sort of special
treatment. If they are appearing as suggestions then I expect one of the
following statements to be true:
- You have a very small number of docs in the result set representing
the "foreground" sample. Significant terms needs a reasonable number of
docs in a sample to draw any real conclusions
- You have query criteria that is not identifying a result set with any
sense of cohesion e.g. a query for random docs
- You have changed the set of stopwords in use in your index - what
previously never used to appear at all is now suddenly common or
vice-versa.
- You are querying across mixed indices or doc-types (one with
stop-words, one without) and we fail to tune-out the stopwords as part of
the results merging process because one small index reports them back as
commonplace while another large index has them as missing or rare. In the
merged stats they therefore appear to be highly correlated with your query
request.
Please let me know if none of these scenarios explain your results.
Another possible enhancement would be get a phrase significance
(instead of a single term, doing a multi term significance) would be nice.
I outline some of the possibilities in creating phrases from significant
terms, starting 51 mins into this recent video:
Revealing the Uncommonly Common with Elasticsearch | SkillsCast | 24th April 2014
the-uncommonly-common-with-elasticsearch
Cheers and Thanks for all the fish
You're welcome and thanks again for the feedback
Mark
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OIorUFaI-KY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGbqZ7i8PZrYYDZasE4d2YF3MHcC8_oG4F7Es%2BuPjAgi97wxEA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.