Best way to index documents for retrieving clustered result


(mp2893) #1

Hi,

I have millions of news articles.
They are clustered based on the similarity of their topic/content
All news documents are clustered on daily basis. So each cluster has
news documents from the same day. .
Each news document has a field named "newsid" that represents to which
news cluster it belongs to.

So basically I have one index, named "news", and one mapping named
"document".
What I want to do is retrieve news clusters with given text queries
and date range queries.
For example,
"Get news clusters that are related to 'Barack Obama' from 2012-04-01
to 2012-04-30. The clusters must have more than 5 documents as the
member of the cluster (namely, the cluster size should be bigger than
5)"
Note that I do not want the documents themselves. I want the clusters,
namely "newsid".

As far as I know, there are a couple of ways to achieve this goal.
I know that parent-child mapping can do this.
Faceting seems to be another option.
I heard that nested mapping also helps but I'm not sure about this.

Currently my "document" mapping has the following fields:
docid, date, title content, newsid, count, url
"count" represents how many members the "newsid" cluster has.

I have additional mapping named "cluster" which is the parent mapping
of "document".
"cluster" mapping has the following fields:
count, date, label
"count" field is the same as above. 'label" field is the label of the
cluster.

I currently do a "has_child" query to achieve my goal. It works fine,
but doesn't seem be quick enough (it takes several seconds to retrieve
entries from "cluster" mapping)
Is my configuration the most optimized one?
Should I be using faceted search instead of using two mappings? (the
"label" field of "cluster" mapping can be integrated into "document"
mapping so that won't be a problem)
I would like some assessment plz.
Thanks in advance.

Ed


(Clinton Gormley) #2

Hiya

So basically I have one index, named "news", and one mapping named
"document".
What I want to do is retrieve news clusters with given text queries
and date range queries.
For example,
"Get news clusters that are related to 'Barack Obama' from 2012-04-01
to 2012-04-30. The clusters must have more than 5 documents as the
member of the cluster (namely, the cluster size should be bigger than
5)"
Note that I do not want the documents themselves. I want the clusters,
namely "newsid".

It sounds like facets are your best bet, at least for the query you
describe above.

Try this:

curl -XGET 'http://127.0.0.1:9200/news/document/_search?pretty=1' -d '
{
"query" : {
"filtered" : {
"filter" : {
"range" : {
"published" : {
"lte" : "2012-04-30",
"gte" : "2012-04-01"
}
}
},
"query" : {
"text" : {
"content" : "barack obama"
}
}
}
},
"facets" : {
"clusters" : {
"terms" : {
"field" : "newsid"
}
}
},
"size" : 0
}
'

clint


(mp2893) #3

Thanks for the reply Clinton!!
I'd try your query format and post results on this thread.

Ed

2012/4/30 Clinton Gormley clint@traveljury.com

Hiya

So basically I have one index, named "news", and one mapping named
"document".
What I want to do is retrieve news clusters with given text queries
and date range queries.
For example,
"Get news clusters that are related to 'Barack Obama' from 2012-04-01
to 2012-04-30. The clusters must have more than 5 documents as the
member of the cluster (namely, the cluster size should be bigger than
5)"
Note that I do not want the documents themselves. I want the clusters,
namely "newsid".

It sounds like facets are your best bet, at least for the query you
describe above.

Try this:

curl -XGET 'http://127.0.0.1:9200/news/document/_search?pretty=1' -d '
{
"query" : {
"filtered" : {
"filter" : {
"range" : {
"published" : {
"lte" : "2012-04-30",
"gte" : "2012-04-01"
}
}
},
"query" : {
"text" : {
"content" : "barack obama"
}
}
}
},
"facets" : {
"clusters" : {
"terms" : {
"field" : "newsid"
}
}
},
"size" : 0
}
'

clint


(system) #4