What is the best way to find collocations of terms? Is it by
indexing documents using the shingles filter, and then performing a terms
facet on that field?
How can I get a list of statistically improbable phrases for a
document? From what I understand, instead of a list of highly frequent
phrases like in the previous one, it would need to be a list based on the
IDF score as well. Is there any way for me to extract that from elastic
search? This is kind of the reverse of what a query returns, a list of
phrases that would have a high score for a given document instead of a list
of documents with a high score for a given phrase.
Are you looking to get this via ES because this looks like the best/easiest
route or some other reason? Long time ago I ran a social bookmarking
service where I took top N matches (with tags) and post-processed them. It
worked OK with top 200 hits.
I wouldn't shingle at index time. You'll shingle all your content and the
index will balloon.
it's a Java lib you can use for getting Collocations and SIPs (and their
hybrids). Unrelated to (Elastic)Search, but people sometimes use it in
document processing pipelines to tag documents before indexing them (so you
don't have to shingle everything in a doc, just add key terms/phrases as
"tags"). You could then facet on that, or index in a separate highly
weighted field, detect trending topics in a stream of data, etc.
On Friday, July 5, 2013 9:31:42 PM UTC-4, Mike wrote:
What is the best way to find collocations of terms? Is it by
indexing documents using the shingles filter, and then performing a terms
facet on that field?
How can I get a list of statistically improbable phrases for a
document? From what I understand, instead of a list of highly frequent
phrases like in the previous one, it would need to be a list based on the
IDF score as well. Is there any way for me to extract that from elastic
search? This is kind of the reverse of what a query returns, a list of
phrases that would have a high score for a given document instead of a list
of documents with a high score for a given phrase.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.