Use Proper Nouns / Named Entities to improve MLT results?

Hi all,

I'm indexing news articles to basically relate them within a cluster of
documents with "more like this". In this case, the results could be heavily
improved if there was a way to give more weight to terms that are proper
nouns / named entities. Is there any way to do this with elasticsearch?

Many thanks in advance.

Ramon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The detection of proper nouns / named entities is currently outside of
the scope of Elasticsearch.

It is possible to integrate indexing programs with text mining
algorithms, such as Stanford NER, Apache OpenNLP NER, or LingPipe, and
create fields for the recognized entities, and a related JSON object /
array that contains the "more like this" synonyms for a field.

Because the NER task is heavy and time consuming, I would not recommend
to run it on the same machine where an ES data node is running. So, a
plugin would be possible, but only for TransportClient side.

Jörg

Am 04.02.13 02:26, schrieb racedo:

Hi all,

I'm indexing news articles to basically relate them within a cluster
of documents with "more like this". In this case, the results could be
heavily improved if there was a way to give more weight to terms that
are proper nouns / named entities. Is there any way to do this with
elasticsearch?

Many thanks in advance.

Ramon

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ramon,

I have some experience in this sort of thing.

(note: I build an open source document analysis platformhttps://github.com/IKANOW/Infinit.e built
on Elasticsearch and including various support for entity extractors, that
is probably too heavyweight for the purposes you describe; though some of
the code may prove useful)

There are a few options, depending on the sort of article you are indexing:
1] You can very cheaply run the text through the OpenNLP "POS tagger"http://opennlp.apache.org/documentation/manual/opennlp.html#tools.postagger.tagging.cmdline ...
this does an OKish job of picking out proper nouns and other things you
might want to run NLP over. You can then stick them in a multi field and
use that with "mlt". Not sure which language your news articles are in, you
can find POS models for many languages though.
2] A slightly more complex but better solution would be to use TextRankhttps://github.com/turian/textrank.
This is built on-top of OpenNLP POS, but then uses a snazzy bit of maths to
pick out significant keyphrases. The trick would then be to tokenize the
keyphrases back into keywords and put them in the mlt array.
3] For mainstream news articles written in English/French/Spanish, OpenCalais
http://www.opencalais.com/is a very good free SaaS named entity
extractor, with a decent daily call allowance. (Disclaimer: I haven't
looked at its performance on languages other than English)
3a] (There are also many commercial alternatives of comparable quality,
some of which have low volume free tiers, eg we have used AlchemyAPIhttp://www.alchemyapi.com/
)
3b] (For slightly less mainstream news articles you will find that the SaaS
offerings like OpenCalais and AlchemyAPI will tend to be over-aggressive at
resolving names to names of famous people - we had to build in some post
processing to "unresolve" names)

(Finally, for geo-tagging, give Clavinhttps://github.com/Berico-Technologies/CLAVINa look. We haven't integrated it into "Infinit.e" yet, but it looks pretty
good.)

I will also say that from experience, you will still probably find the
results of the "mlt" query disappointing (I'd love to hear about it if you
don't!) - the "standard" way of clustering documents involves using Mahouthttps://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html,
which integrates well with Lucene but not so much Elasticsearch.

Hope this helps!

Alex
www.ikanow.com

On Sunday, February 3, 2013 8:26:43 PM UTC-5, racedo wrote:

Hi all,

I'm indexing news articles to basically relate them within a cluster of
documents with "more like this". In this case, the results could be heavily
improved if there was a way to give more weight to terms that are proper
nouns / named entities. Is there any way to do this with elasticsearch?

Many thanks in advance.

Ramon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Alex, this is excellent feedback, much appreciated. I'd like to keep using
elasticsearch for simplicity reasons and as Joerg suggested a plug-in might
come handy too. Thanks to Joerg as well for his suggestions.

On 5 February 2013 07:37, Alex at Ikanow apiggott@ikanow.com wrote:

Ramon,

I have some experience in this sort of thing.

(note: I build an open source document analysis platformhttps://github.com/IKANOW/Infinit.e built
on Elasticsearch and including various support for entity extractors, that
is probably too heavyweight for the purposes you describe; though some of
the code may prove useful)

There are a few options, depending on the sort of article you are indexing:
1] You can very cheaply run the text through the OpenNLP "POS tagger"http://opennlp.apache.org/documentation/manual/opennlp.html#tools.postagger.tagging.cmdline ...
this does an OKish job of picking out proper nouns and other things you
might want to run NLP over. You can then stick them in a multi field and
use that with "mlt". Not sure which language your news articles are in, you
can find POS models for many languages though.
2] A slightly more complex but better solution would be to use TextRankhttps://github.com/turian/textrank.
This is built on-top of OpenNLP POS, but then uses a snazzy bit of maths to
pick out significant keyphrases. The trick would then be to tokenize the
keyphrases back into keywords and put them in the mlt array.
3] For mainstream news articles written in English/French/Spanish, OpenCalais
http://www.opencalais.com/is a very good free SaaS named entity
extractor, with a decent daily call allowance. (Disclaimer: I haven't
looked at its performance on languages other than English)
3a] (There are also many commercial alternatives of comparable quality,
some of which have low volume free tiers, eg we have used AlchemyAPIhttp://www.alchemyapi.com/
)
3b] (For slightly less mainstream news articles you will find that the
SaaS offerings like OpenCalais and AlchemyAPI will tend to be
over-aggressive at resolving names to names of famous people - we had to
build in some post processing to "unresolve" names)

(Finally, for geo-tagging, give Clavinhttps://github.com/Berico-Technologies/CLAVINa look. We haven't integrated it into "Infinit.e" yet, but it looks pretty
good.)

I will also say that from experience, you will still probably find the
results of the "mlt" query disappointing (I'd love to hear about it if you
don't!) - the "standard" way of clustering documents involves using Mahouthttps://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html,
which integrates well with Lucene but not so much Elasticsearch.

Hope this helps!

Alex
www.ikanow.com

On Sunday, February 3, 2013 8:26:43 PM UTC-5, racedo wrote:

Hi all,

I'm indexing news articles to basically relate them within a cluster of
documents with "more like this". In this case, the results could be heavily
improved if there was a way to give more weight to terms that are proper
nouns / named entities. Is there any way to do this with elasticsearch?

Many thanks in advance.

Ramon

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Ramon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.