I have millions of news articles.
They are clustered based on the similarity of their topic/content
All news documents are clustered on daily basis. So each cluster has
news documents from the same day. .
Each news document has a field named "newsid" that represents to which
news cluster it belongs to.
So basically I have one index, named "news", and one mapping named
What I want to do is retrieve news clusters with given text queries
and date range queries.
"Get news clusters that are related to 'Barack Obama' from 2012-04-01
to 2012-04-30. The clusters must have more than 5 documents as the
member of the cluster (namely, the cluster size should be bigger than
Note that I do not want the documents themselves. I want the clusters,
As far as I know, there are a couple of ways to achieve this goal.
I know that parent-child mapping can do this.
Faceting seems to be another option.
I heard that nested mapping also helps but I'm not sure about this.
Currently my "document" mapping has the following fields:
docid, date, title content, newsid, count, url
"count" represents how many members the "newsid" cluster has.
I have additional mapping named "cluster" which is the parent mapping
"cluster" mapping has the following fields:
count, date, label
"count" field is the same as above. 'label" field is the label of the
I currently do a "has_child" query to achieve my goal. It works fine,
but doesn't seem be quick enough (it takes several seconds to retrieve
entries from "cluster" mapping)
Is my configuration the most optimized one?
Should I be using faceted search instead of using two mappings? (the
"label" field of "cluster" mapping can be integrated into "document"
mapping so that won't be a problem)
I would like some assessment plz.
Thanks in advance.