Entity detection in text

I have a solution in mind for a problem that I have that I would like some
feedback on before I embark on experimenting with it.

My situation is as follows:

I have a large data set with named locations that each may have several
alternative names, synonyms etc for things like venues, neighborhoods,
streets, pois, etc. The data set includes tens of millions of names.
Elasticsearch is great for indexing this type of data and I already use ES
on this data for other use cases.

My problem is identifying relevant references to those locations in text.
So, given e.g. a news paper article that mentions several locations by
name, I'd like to come up with a short list of possible matches in my
location index. It doesn't have to be very exact and it doesn't have to be
very fast. False positives are OK (and inevitable) but false negatives
would be a concern. If I can get a reasonably short list of locations, I
can post process the results to get rid of false positives.

The normal approach for this kind of problem is to use NLP with some
machine learning model that picks apart the text and comes up with a list
of locations. My main issue with this approach is that it does not detect
all references (false negatives) and it doesn't really benefit from all the
location meta data that I have indexed. I've played around with a few nlp
products and libraries and am so far not overly impressed with the results.
It's a very hard problem and there is just a lot of ambiguity in text.

Looking at the recent suggestions and finite state machine work in lucene
that makes it possible to fire thousands of completion queries per second
at my huge index, it occurred to me that I might just generate thousands of
queries from the article and brute force my way through the problem and get
a reasonably short list of possible locations for any article. If each
query takes e.g. 2ms and I need to run 20000 of them to figure out whether
combinations of words in the text match any location names that is actually
something that is feasible in well under a minute and with ES I can just
throw more hardware at the problem if I need it to be faster. If I could
get that down to a few seconds, that would be perfectly acceptable. A
reasonably short list of possible matches would allow me to use other
techniques in a post processing step to figure out the best ones.

Does this sound like a good plan? Are there other features in ES that I
might use? Or am I being naive?

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Authority files for library catalogs or freebase.com are valuable sources
for named entity recognition (NER) beside corpus data, like OpenCalais. My
approach to make library catalogs a helpful tool for the public with the
help of authority files (GND or VIAF - viaf.org) is as follows:

  • each entity in VIAF has a unique ID (e.g. an URI in Linked Data). By the
    unique ID, a bundle of name variants is registered. VIAF is also
    multi-lingual. ES can index authority data according to the unique ID.

  • as an alternative, by using FSA/FST, the authority data can be prepared
    for recognition of names. FST is the fastest known method, also used in
    string pattern matching algorithms. With FST, a Lucene/ES token filter can
    be implemented to attach entity information when indexing unstructured data
    with unknown entities.

  • if the entity information attached in the index is the ID, the app layer
    can decide how to access more authority data information (the unique ID may
    be also indexed in ES or may represent an URL that points to modifiable
    information about the entity)

With my baseform analysis plugin, I have prepared a stripped down FSA
implementation of the Lucene's one in the Lucene morfologik analyzer. The
advantage of the Lucene FSA is the compact implementation for creating a
lexicon-based token fiter. The disadvantage of this implementation is the
input for the FSA must be sorted and the FSA can't be modified after
creation. I have also other FSA/FST automata implementations which do not
need input sorting and can grow dynamically but use more memory resources.

If freebase.com can be prepared as (a bunch of) FSA, it would be possible
to write a naive FSA-based NER plugin for ES. Why naive? The magic of NLP
is that it promises to recognize more features in a text like an FSA can
do. With POS tagging and sentence boundary detection, like OpenNLP, UIMA,
or Stanford NLP can do, it is possible to resolve disambiguations in the
meaning of words. Another problem is when using multiple languages in a
single text. This problem is hard, even for the best NLP implementations
out there. With my langdetect plugin, a list of languages can be detected
in ES fields, and this may help further NLP based processing.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for your detailed response. Glad to see some validation in the real
world of this idea.

The nice thing about locations is that they are less ambiguous in the
context of other locations (e.g. streets that are part of a neighborhood,
that are part of a city in a country). My problem with NLP based approaches
is that they can't be easily tweaked to take that into account. A second
reason I like the FST based approach is that I can use the rich data I have
on translations, synonyms etc. Your lang detect plugin looks interesting as
well.

I have no fixed plan yet for how to do something with this but I might get
around to experimenting a little further with this.

Jilles

On Saturday, October 26, 2013 2:32:20 PM UTC+2, Jörg Prante wrote:

Authority files for library catalogs or freebase.com are valuable sources
for named entity recognition (NER) beside corpus data, like OpenCalais. My
approach to make library catalogs a helpful tool for the public with the
help of authority files (GND or VIAF - viaf.org) is as follows:

  • each entity in VIAF has a unique ID (e.g. an URI in Linked Data). By the
    unique ID, a bundle of name variants is registered. VIAF is also
    multi-lingual. ES can index authority data according to the unique ID.

  • as an alternative, by using FSA/FST, the authority data can be prepared
    for recognition of names. FST is the fastest known method, also used in
    string pattern matching algorithms. With FST, a Lucene/ES token filter can
    be implemented to attach entity information when indexing unstructured data
    with unknown entities.

  • if the entity information attached in the index is the ID, the app layer
    can decide how to access more authority data information (the unique ID may
    be also indexed in ES or may represent an URL that points to modifiable
    information about the entity)

With my baseform analysis plugin, I have prepared a stripped down FSA
implementation of the Lucene's one in the Lucene morfologik analyzer. The
advantage of the Lucene FSA is the compact implementation for creating a
lexicon-based token fiter. The disadvantage of this implementation is the
input for the FSA must be sorted and the FSA can't be modified after
creation. I have also other FSA/FST automata implementations which do not
need input sorting and can grow dynamically but use more memory resources.

If freebase.com can be prepared as (a bunch of) FSA, it would be possible
to write a naive FSA-based NER plugin for ES. Why naive? The magic of NLP
is that it promises to recognize more features in a text like an FSA can
do. With POS tagging and sentence boundary detection, like OpenNLP, UIMA,
or Stanford NLP can do, it is possible to resolve disambiguations in the
meaning of words. Another problem is when using multiple languages in a
single text. This problem is hard, even for the best NLP implementations
out there. With my langdetect plugin, a list of languages can be detected
in ES fields, and this may help further NLP based processing.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Forgot to mention Geonames for locations. You are absolutely right that
NLP/FST does not solve the problem how to obtain a logical structure of
geospatial information from given place names in a text.

All names have a unique ID and from the example
http://sws.geonames.org/3020251/about.rdf you can see how a geo entity is
linked by relations to other geo entities.

The idea is to index
http://datahub.io/de/dataset/geonames-semantic-webinto ES (for
reference or maybe percolation?) and also build an FST based
token filter to catch the variants of the geo names. The token filter may
produce a list of geo entity IDs if there is more than one possible
location found. If more context can be given, like country or language or
distance, the token filter may even weigh the result for a most probable
geo entity.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Also i would also ask you to look into Clavin -
http://clavin.bericotechnologies.com/
Its a NLP module just to detect locations.

Thanks
Vineeth

On Mon, Oct 28, 2013 at 3:15 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Forgot to mention Geonames for locations. You are absolutely right that
NLP/FST does not solve the problem how to obtain a logical structure of
geospatial information from given place names in a text.

All names have a unique ID and from the example
http://sws.geonames.org/3020251/about.rdf you can see how a geo entity is
linked by relations to other geo entities.

The idea is to index http://datahub.io/de/dataset/geonames-semantic-webinto ES (for reference or maybe percolation?) and also build an FST based
token filter to catch the variants of the geo names. The token filter may
produce a list of geo entity IDs if there is more than one possible
location found. If more context can be given, like country or language or
distance, the token filter may even weigh the result for a most probable
geo entity.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hm, with Clavin, I have not much success to recognize places given in
german languages or variant names. But maybe it's just me.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'll take a look at clavin as well; it looks like a really interesting
approach. It seems to perform reasonably well on a few news articles I
tried. It looks like its geonames only currently but I might be able to
feed it some more data.

I've actually been trying to integrate geonames, open streetmap, and
geoplanet and a few other datasets. Not the easiest problem either to solve
but it has gotten me a quite nice data set that organizes locations
hierarchically. That's what I'm looking to use. Multi-lingual is a huge
issue here as well.

The main challenge is at the bottom of the graph, not the top. The more
specific the location the better. Country is not enough. City is nice to
have; neighborhood better, street great, venue/poi would be best. I've
played with the alchemy demo a bit, which is fairly good at extracting a
few general things like cities but misses all the specific references to
actual places in the city. Also tried some of the standard models that come
with opennlp and found that in most cases it misses really obvious things.
I managed to make it identify Berlin as a possible place reference in a
simple sentence but not Kreuzberg in exactly the same sentence or Berlinn
(misspelled).

Jilles

On Monday, October 28, 2013 11:44:44 AM UTC+1, Jörg Prante wrote:

Hm, with Clavin, I have not much success to recognize places given in
german languages or variant names. But maybe it's just me.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like