I am trying to use elasticsearch to do some document understanding. Here is the index settings
DELETE index
PUT index/_doc/type_name
{
"id": "id1"
"name": "Lake Tahoe"
"indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
"id": "id2"
"name": "Nevada"
"indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
"id": "id3"
"name": "California"
"indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
"id": "id4"
"name": "Texas"
"indexed_at": "2019-03-01"
}
I want to extract all of the related terms (e.g. "Lake Tahoe", "California", "Nevada" etc.) from such a long text/document:
"Lake Tahoe is a large freshwater lake in the Sierra Nevada Mountains, straddling the border of California and Nevada. It’s known for its beaches and ski resorts."
Is it feasible? How shall I index and which query pattern shall I use?
I have the same question ditto. If I want to retrieve the entities I stored in ElasticSearch for both short text such as "Lake Tahoe resort" as well as the long text I listed as the above example. Does it mean I have to change every single entities that I stored to be with "annotated_text" field type?
No. Entity extraction tools like OpenNLP, Rosette or OpenCalais mine structured data (people, places, companies) from unstructured text. Ordinarily you store the structured data in structured ‘keyword’ type fields and keep the original text as a ‘text’ type field. However, if you choose to use the new ‘annotated_text’ field type you can upgrade your text to include both free text and the structured keywords discovered by your choice of entity extraction tool. You don’t have to use the annotated text field type but the blog I shared illustrates the benefits of this technique.
I wrote the support for annotated_text out of the frustration in dealing with plain ‘keyword’ and ‘text’ fields. For example, an aggregation on the keyword field would tell me “John F Kennedy” is mentioned in some docs but I could never see where in a long text he was mentioned. The highlighter would not work on the text field because the text mentioned “JFK” - not “John F Kennedy”. All traceability of what the entity extraction tool discovered (JFK = John F Kennedy) was lost. With the annotated text field we can highlight mentions of extracted entities like this because we weave the structured keyword values into the indexed text.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.