Can elasticsearch handle long text?

flash · March 26, 2019, 6:25pm

I am trying to use elasticsearch to do some document understanding. Here is the index settings

DELETE index
PUT index/_doc/type_name
{
   "id": "id1"
   "name": "Lake Tahoe"
   "indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
   "id": "id2"
   "name": "Nevada"
   "indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
   "id": "id3"
   "name": "California"
   "indexed_at": "2019-03-01"
}
PUT index/_doc/type_name
{
   "id": "id4"
   "name": "Texas"
   "indexed_at": "2019-03-01"
}

I want to extract all of the related terms (e.g. "Lake Tahoe", "California", "Nevada" etc.) from such a long text/document:

"Lake Tahoe is a large freshwater lake in the Sierra Nevada Mountains, straddling the border of California and Nevada. It’s known for its beaches and ski resorts."

Is it feasible? How shall I index and which query pattern shall I use?

dadoonet · March 26, 2019, 11:05pm

I think you can use the percolator API for something similar to what you described.

A different way may be https://github.com/spinscale/elasticsearch-ingest-opennlp but may be your example was only theorical and not the real use case...

Mark_Harwood · March 27, 2019, 9:25am

There are commercial vendors out there too e.g. try the demo at Products | LSEG

Whatever tool you pick, the new annotated_text field type is specifically designed to index this type of content in elasticsearch.

flash · March 28, 2019, 12:54am

Thanks for replying, @dadoonet !

I want to handle both short-text as well as long-text/document, so not sure if I changed how it index would help?

Or does it mean for every entity that I want to stored, I should add "percolator" type into the index?

flash · March 28, 2019, 12:57am

Thanks @Mark_Harwood!

I have the same question ditto. If I want to retrieve the entities I stored in ElasticSearch for both short text such as "Lake Tahoe resort" as well as the long text I listed as the above example. Does it mean I have to change every single entities that I stored to be with "annotated_text" field type?

Mark_Harwood · March 28, 2019, 7:52am

No. Entity extraction tools like OpenNLP, Rosette or OpenCalais mine structured data (people, places, companies) from unstructured text. Ordinarily you store the structured data in structured ‘keyword’ type fields and keep the original text as a ‘text’ type field. However, if you choose to use the new ‘annotated_text’ field type you can upgrade your text to include both free text and the structured keywords discovered by your choice of entity extraction tool. You don’t have to use the annotated text field type but the blog I shared illustrates the benefits of this technique.
I wrote the support for annotated_text out of the frustration in dealing with plain ‘keyword’ and ‘text’ fields. For example, an aggregation on the keyword field would tell me “John F Kennedy” is mentioned in some docs but I could never see where in a long text he was mentioned. The highlighter would not work on the text field because the text mentioned “JFK” - not “John F Kennedy”. All traceability of what the entity extraction tool discovered (JFK = John F Kennedy) was lost. With the annotated text field we can highlight mentions of extracted entities like this because we weave the structured keyword values into the indexed text.

system · April 25, 2019, 7:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ELSER for long texts Elasticsearch	0	35	November 22, 2024
Long keyword/text fields just for retrieval Elasticsearch	6	1378	February 9, 2017
Storing Very Large Text Field in Elasticsearch Elasticsearch	3	255	July 24, 2025
Indexing very long word Elasticsearch	1	523	April 22, 2020
Indexing custom Lucene documents Elasticsearch	6	590	July 6, 2017

Can elasticsearch handle long text?

Related topics