Highlighting on Fields that aren't Stored or in _Source

John_Bartak · April 17, 2012, 9:52pm

Here's my situation: I'm planning an ElasticSearch solution that will
index millions (possibly hundreds of millions) of documents that will run
on EC2 instance-store instances. The amount of storage you get per
instances is limited (1.6 TB for X-Large instances) so I was planning on
storing as little data as possible in the index. But I want to be able to
highlight fields that aren't stored in the index. I'm planning on using
Tika (and a few other tools) to extract content and I'll store the
extracted content in S3 (so I don't have to re-extract the content every
time I re-index a document).

So is there some way I can highlight content that is in S3, probably as a
post-processing step of the search?

Looking at the source code, it seems like it should be possible to do
something similar to what was done with the SourceSimpleFragmentsBuilder.
Basically inherit from SimpleFramentsBuilder and override createFragments
so if there are no values return a JSON document that contains the
startOffset, endOffset, and termOffsets for each highlight. Then have my
code use those offsets with the content stored in S3 to build highlights.
(Of course this only works for fields that store term-vectors).

Does this seem like a reasonable idea or would I just be better off storing
compressed data into the _Source field?

kimchy · April 19, 2012, 1:39pm

There is and issue open to allow to get the _source of a document from an
external storage (which can be implemented as a plugin). Its not there yet
though...

On Wed, Apr 18, 2012 at 12:52 AM, JBartak john.bartak@autodesk.com wrote:

Here's my situation: I'm planning an Elasticsearch solution that will
index millions (possibly hundreds of millions) of documents that will run
on EC2 instance-store instances. The amount of storage you get per
instances is limited (1.6 TB for X-Large instances) so I was planning on
storing as little data as possible in the index. But I want to be able to
highlight fields that aren't stored in the index. I'm planning on using
Tika (and a few other tools) to extract content and I'll store the
extracted content in S3 (so I don't have to re-extract the content every
time I re-index a document).

So is there some way I can highlight content that is in S3, probably as a
post-processing step of the search?

Looking at the source code, it seems like it should be possible to do
something similar to what was done with the SourceSimpleFragmentsBuilder.
Basically inherit from SimpleFramentsBuilder and override createFragments
so if there are no values return a JSON document that contains the
startOffset, endOffset, and termOffsets for each highlight. Then have my
code use those offsets with the content stored in S3 to build highlights.
(Of course this only works for fields that store term-vectors).

Does this seem like a reasonable idea or would I just be better off
storing compressed data into the _Source field?

Topic		Replies	Views
Large (stored) fields, json source and highlighting Elasticsearch	19	1423	July 6, 2017
Highlighting issues Elasticsearch	5	809	January 24, 2017
Just Puhsed: Allowing to highlight from source (no need for stored fields) Elasticsearch	3	480	July 6, 2017
Disabling _source and using stored fields Elasticsearch	2	733	July 6, 2017
Highlight in the field response Elasticsearch	4	211	July 3, 2023

Highlighting on Fields that aren't Stored or in _Source

Related topics