Can elasticsearch handle long text?

I am trying to use elasticsearch to do some document understanding. Here is the index settings

DELETE index
PUT index/_doc/type_name
{
   "id": "id1"
   "name": "Lake Tahoe"
   "indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
   "id": "id2"
   "name": "Nevada"
   "indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
   "id": "id3"
   "name": "California"
   "indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
   "id": "id4"
   "name": "Texas"
   "indexed_at": "2019-03-01"
}

I want to extract all of the related terms (e.g. "Lake Tahoe", "California", "Nevada" etc.) from such a long text/document:

"Lake Tahoe is a large freshwater lake in the Sierra Nevada Mountains, straddling the border of California and Nevada. It’s known for its beaches and ski resorts."

Is it feasible? How shall I index and which query pattern shall I use?

I think you can use the percolator API for something similar to what you described.

A different way may be https://github.com/spinscale/elasticsearch-ingest-opennlp but may be your example was only theorical and not the real use case...

There are commercial vendors out there too e.g. try the demo at http://www.opencalais.com/opencalais-demo/

Whatever tool you pick, the new annotated_text field type is specifically designed to index this type of content in elasticsearch.

Thanks for replying, @dadoonet !

I want to handle both short-text as well as long-text/document, so not sure if I changed how it index would help?

Or does it mean for every entity that I want to stored, I should add "percolator" type into the index?

Thanks @Mark_Harwood!

I have the same question ditto. If I want to retrieve the entities I stored in ElasticSearch for both short text such as "Lake Tahoe resort" as well as the long text I listed as the above example. Does it mean I have to change every single entities that I stored to be with "annotated_text" field type?

No. Entity extraction tools like OpenNLP, Rosette or OpenCalais mine structured data (people, places, companies) from unstructured text. Ordinarily you store the structured data in structured ‘keyword’ type fields and keep the original text as a ‘text’ type field. However, if you choose to use the new ‘annotated_text’ field type you can upgrade your text to include both free text and the structured keywords discovered by your choice of entity extraction tool. You don’t have to use the annotated text field type but the blog I shared illustrates the benefits of this technique.
I wrote the support for annotated_text out of the frustration in dealing with plain ‘keyword’ and ‘text’ fields. For example, an aggregation on the keyword field would tell me “John F Kennedy” is mentioned in some docs but I could never see where in a long text he was mentioned. The highlighter would not work on the text field because the text mentioned “JFK” - not “John F Kennedy”. All traceability of what the entity extraction tool discovered (JFK = John F Kennedy) was lost. With the annotated text field we can highlight mentions of extracted entities like this because we weave the structured keyword values into the indexed text.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.