Time-based Phrase Search Without Doubling Transcript field storage – Any Better Way?

Hi,

We’re building a solution that allows time-based phrase search over audio transcripts in Elasticsearch. The goal is to find exact phrases that occur at a certain point in time, e.g., "may i help you" within the first 5 seconds of a conversation; find a phrase "goodbeye" within the last 5 seconds of a conversation.

We preprocess our transcripts field to include word-level time offsets, like so:
Original "transcript": "hello this is example"
With offsets format ({word}|{startOffsetInMs}) "transcriptWithStartOffset": "_sot_|0 hello|600 this|1000 is|1300 example|1600"
"_sot_" stands for "start of transcript" and is used just as a zero-based token.
We created a custom analyzer plugin that:

  • Parses the text above
  • Sets token positions based on the {startTimeOffsetInMs}
    E.g. "_sot_" - position 0, "hello" - position 600, "this" - position 1000 etc.
  • Allows time-based phrase search using interval filter like this:
...
"intervals": {
            "transcriptWithStartOffset": {
              "match": {
                "query": "hello this is",
                "filter": {
                  "script": {
                    "source": "interval.end < 1100"
                  }
                }
              }
            }
          }
...

The Problem

To enable this, we now store two fields per document:

  • transcript: plain text - required for other type of searches
  • transcriptWithStartOffset: custom-formatted version for time-based search

This doubles the storage for large transcripts — and that's becoming costly at scale!

We’re looking for guidance or ideas to reduce duplication while retaining time-based search features.

Specifically:

  1. Is there a better way to store offsets without duplicating the transcript text?
  2. Could offsets live in a separate field (like wordsWithOffsets: [{word: "hello", offset: 600}]) and still be usable in queries with custom analyzers?
  3. Is there any supported way for a custom analyzer to read from:
  • Other fields in the document? (I've found the analyzer has no visibility into other fields of the document)
  • An external file or API for offset lookup?
  1. Reduce the size using Index-level compression settings?

We understand analyzers are designed to be fast and isolated, so we’re open to alternative approaches — maybe using ingest pipelines, runtime fields, or transforms?

Thanks in advance for any ideas or best practices!