Entity detection in text

jprante · October 26, 2013, 12:32pm

Authority files for library catalogs or freebase.com are valuable sources
for named entity recognition (NER) beside corpus data, like OpenCalais. My
approach to make library catalogs a helpful tool for the public with the
help of authority files (GND or VIAF - viaf.org) is as follows:

each entity in VIAF has a unique ID (e.g. an URI in Linked Data). By the
unique ID, a bundle of name variants is registered. VIAF is also
multi-lingual. ES can index authority data according to the unique ID.
as an alternative, by using FSA/FST, the authority data can be prepared
for recognition of names. FST is the fastest known method, also used in
string pattern matching algorithms. With FST, a Lucene/ES token filter can
be implemented to attach entity information when indexing unstructured data
with unknown entities.
if the entity information attached in the index is the ID, the app layer
can decide how to access more authority data information (the unique ID may
be also indexed in ES or may represent an URL that points to modifiable
information about the entity)

With my baseform analysis plugin, I have prepared a stripped down FSA
implementation of the Lucene's one in the Lucene morfologik analyzer. The
advantage of the Lucene FSA is the compact implementation for creating a
lexicon-based token fiter. The disadvantage of this implementation is the
input for the FSA must be sorted and the FSA can't be modified after
creation. I have also other FSA/FST automata implementations which do not
need input sorting and can grow dynamically but use more memory resources.

If freebase.com can be prepared as (a bunch of) FSA, it would be possible
to write a naive FSA-based NER plugin for ES. Why naive? The magic of NLP
is that it promises to recognize more features in a text like an FSA can
do. With POS tagging and sentence boundary detection, like OpenNLP, UIMA,
or Stanford NLP can do, it is possible to resolve disambiguations in the
meaning of words. Another problem is when using multiple languages in a
single text. This problem is hard, even for the best NLP implementations
out there. With my langdetect plugin, a list of languages can be detected
in ES fields, and this may help further NLP based processing.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Use Proper Nouns / Named Entities to improve MLT results? Elasticsearch	4	880	July 6, 2017
Help needed with the query Elasticsearch	10	985	July 6, 2017
Advice on my approach to this search problem Elasticsearch	11	548	July 6, 2017
Reverse search: I give you a block of text, you tell me which indexed documents have a specific field value that matches it Elasticsearch	12	605	July 6, 2017
ES interface for Finite state transducer (FST) (not completion suggester) Elasticsearch	1	1060	July 5, 2017

Entity detection in text

Related topics