I have some experience in this sort of thing.
(note: I build an open source document analysis platformhttps://github.com/IKANOW/Infinit.e built
on ElasticSearch and including various support for entity extractors, that
is probably too heavyweight for the purposes you describe; though some of
the code may prove useful)
There are a few options, depending on the sort of article you are indexing:
1] You can very cheaply run the text through the OpenNLP "POS tagger"http://opennlp.apache.org/documentation/manual/opennlp.html#tools.postagger.tagging.cmdline ...
this does an OKish job of picking out proper nouns and other things you
might want to run NLP over. You can then stick them in a multi field and
use that with "mlt". Not sure which language your news articles are in, you
can find POS models for many languages though.
2] A slightly more complex but better solution would be to use TextRankhttps://github.com/turian/textrank.
This is built on-top of OpenNLP POS, but then uses a snazzy bit of maths to
pick out significant keyphrases. The trick would then be to tokenize the
keyphrases back into keywords and put them in the mlt array.
3] For mainstream news articles written in English/French/Spanish, OpenCalais
http://www.opencalais.com/is a very good free SaaS named entity
extractor, with a decent daily call allowance. (Disclaimer: I haven't
looked at its performance on languages other than English)
3a] (There are also many commercial alternatives of comparable quality,
some of which have low volume free tiers, eg we have used AlchemyAPIhttp://www.alchemyapi.com/
3b] (For slightly less mainstream news articles you will find that the SaaS
offerings like OpenCalais and AlchemyAPI will tend to be over-aggressive at
resolving names to names of famous people - we had to build in some post
processing to "unresolve" names)
(Finally, for geo-tagging, give Clavinhttps://github.com/Berico-Technologies/CLAVINa look. We haven't integrated it into "Infinit.e" yet, but it looks pretty
I will also say that from experience, you will still probably find the
results of the "mlt" query disappointing (I'd love to hear about it if you
don't!) - the "standard" way of clustering documents involves using Mahouthttps://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html,
which integrates well with Lucene but not so much Elasticsearch.
Hope this helps!
On Sunday, February 3, 2013 8:26:43 PM UTC-5, racedo wrote:
I'm indexing news articles to basically relate them within a cluster of
documents with "more like this". In this case, the results could be heavily
improved if there was a way to give more weight to terms that are proper
nouns / named entities. Is there any way to do this with elasticsearch?
Many thanks in advance.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.