Ramon,
I have some experience in this sort of thing.
(note: I build an open source document analysis platformhttps://github.com/IKANOW/Infinit.e built
on Elasticsearch and including various support for entity extractors, that
is probably too heavyweight for the purposes you describe; though some of
the code may prove useful)
There are a few options, depending on the sort of article you are indexing:
1] You can very cheaply run the text through the OpenNLP "POS tagger"http://opennlp.apache.org/documentation/manual/opennlp.html#tools.postagger.tagging.cmdline ...
this does an OKish job of picking out proper nouns and other things you
might want to run NLP over. You can then stick them in a multi field and
use that with "mlt". Not sure which language your news articles are in, you
can find POS models for many languages though.
2] A slightly more complex but better solution would be to use TextRankhttps://github.com/turian/textrank.
This is built on-top of OpenNLP POS, but then uses a snazzy bit of maths to
pick out significant keyphrases. The trick would then be to tokenize the
keyphrases back into keywords and put them in the mlt array.
3] For mainstream news articles written in English/French/Spanish, OpenCalais
http://www.opencalais.com/is a very good free SaaS named entity
extractor, with a decent daily call allowance. (Disclaimer: I haven't
looked at its performance on languages other than English)
3a] (There are also many commercial alternatives of comparable quality,
some of which have low volume free tiers, eg we have used AlchemyAPIhttp://www.alchemyapi.com/
)
3b] (For slightly less mainstream news articles you will find that the SaaS
offerings like OpenCalais and AlchemyAPI will tend to be over-aggressive at
resolving names to names of famous people - we had to build in some post
processing to "unresolve" names)
(Finally, for geo-tagging, give Clavinhttps://github.com/Berico-Technologies/CLAVINa look. We haven't integrated it into "Infinit.e" yet, but it looks pretty
good.)
I will also say that from experience, you will still probably find the
results of the "mlt" query disappointing (I'd love to hear about it if you
don't!) - the "standard" way of clustering documents involves using Mahouthttps://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html,
which integrates well with Lucene but not so much Elasticsearch.
Hope this helps!
Alex
www.ikanow.com
On Sunday, February 3, 2013 8:26:43 PM UTC-5, racedo wrote:
Hi all,
I'm indexing news articles to basically relate them within a cluster of
documents with "more like this". In this case, the results could be heavily
improved if there was a way to give more weight to terms that are proper
nouns / named entities. Is there any way to do this with elasticsearch?
Many thanks in advance.
Ramon
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.