Entity detection in text

Authority files for library catalogs or freebase.com are valuable sources
for named entity recognition (NER) beside corpus data, like OpenCalais. My
approach to make library catalogs a helpful tool for the public with the
help of authority files (GND or VIAF - viaf.org) is as follows:

  • each entity in VIAF has a unique ID (e.g. an URI in Linked Data). By the
    unique ID, a bundle of name variants is registered. VIAF is also
    multi-lingual. ES can index authority data according to the unique ID.

  • as an alternative, by using FSA/FST, the authority data can be prepared
    for recognition of names. FST is the fastest known method, also used in
    string pattern matching algorithms. With FST, a Lucene/ES token filter can
    be implemented to attach entity information when indexing unstructured data
    with unknown entities.

  • if the entity information attached in the index is the ID, the app layer
    can decide how to access more authority data information (the unique ID may
    be also indexed in ES or may represent an URL that points to modifiable
    information about the entity)

With my baseform analysis plugin, I have prepared a stripped down FSA
implementation of the Lucene's one in the Lucene morfologik analyzer. The
advantage of the Lucene FSA is the compact implementation for creating a
lexicon-based token fiter. The disadvantage of this implementation is the
input for the FSA must be sorted and the FSA can't be modified after
creation. I have also other FSA/FST automata implementations which do not
need input sorting and can grow dynamically but use more memory resources.

If freebase.com can be prepared as (a bunch of) FSA, it would be possible
to write a naive FSA-based NER plugin for ES. Why naive? The magic of NLP
is that it promises to recognize more features in a text like an FSA can
do. With POS tagging and sentence boundary detection, like OpenNLP, UIMA,
or Stanford NLP can do, it is possible to resolve disambiguations in the
meaning of words. Another problem is when using multiple languages in a
single text. This problem is hard, even for the best NLP implementations
out there. With my langdetect plugin, a list of languages can be detected
in ES fields, and this may help further NLP based processing.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.