Highlighting on Fields that aren't Stored or in _Source

Here's my situation: I'm planning an ElasticSearch solution that will
index millions (possibly hundreds of millions) of documents that will run
on EC2 instance-store instances. The amount of storage you get per
instances is limited (1.6 TB for X-Large instances) so I was planning on
storing as little data as possible in the index. But I want to be able to
highlight fields that aren't stored in the index. I'm planning on using
Tika (and a few other tools) to extract content and I'll store the
extracted content in S3 (so I don't have to re-extract the content every
time I re-index a document).

So is there some way I can highlight content that is in S3, probably as a
post-processing step of the search?

Looking at the source code, it seems like it should be possible to do
something similar to what was done with the SourceSimpleFragmentsBuilder.
Basically inherit from SimpleFramentsBuilder and override createFragments
so if there are no values return a JSON document that contains the
startOffset, endOffset, and termOffsets for each highlight. Then have my
code use those offsets with the content stored in S3 to build highlights.
(Of course this only works for fields that store term-vectors).

Does this seem like a reasonable idea or would I just be better off storing
compressed data into the _Source field?

There is and issue open to allow to get the _source of a document from an
external storage (which can be implemented as a plugin). Its not there yet
though...

On Wed, Apr 18, 2012 at 12:52 AM, JBartak john.bartak@autodesk.com wrote:

Here's my situation: I'm planning an Elasticsearch solution that will
index millions (possibly hundreds of millions) of documents that will run
on EC2 instance-store instances. The amount of storage you get per
instances is limited (1.6 TB for X-Large instances) so I was planning on
storing as little data as possible in the index. But I want to be able to
highlight fields that aren't stored in the index. I'm planning on using
Tika (and a few other tools) to extract content and I'll store the
extracted content in S3 (so I don't have to re-extract the content every
time I re-index a document).

So is there some way I can highlight content that is in S3, probably as a
post-processing step of the search?

Looking at the source code, it seems like it should be possible to do
something similar to what was done with the SourceSimpleFragmentsBuilder.
Basically inherit from SimpleFramentsBuilder and override createFragments
so if there are no values return a JSON document that contains the
startOffset, endOffset, and termOffsets for each highlight. Then have my
code use those offsets with the content stored in S3 to build highlights.
(Of course this only works for fields that store term-vectors).

Does this seem like a reasonable idea or would I just be better off
storing compressed data into the _Source field?