Time-based Phrase Search Without Doubling Transcript field storage – Any Better Way?

Andrii_Tapuzov · July 3, 2025, 10:05am

Hi,

We’re building a solution that allows time-based phrase search over audio transcripts in Elasticsearch. The goal is to find exact phrases that occur at a certain point in time, e.g., "may i help you" within the first 5 seconds of a conversation; find a phrase "goodbeye" within the last 5 seconds of a conversation.

We preprocess our transcripts field to include word-level time offsets, like so:
Original "transcript": "hello this is example"
With offsets format ({word}|{startOffsetInMs}) "transcriptWithStartOffset": "_sot_|0 hello|600 this|1000 is|1300 example|1600"
"_sot_" stands for "start of transcript" and is used just as a zero-based token.
We created a custom analyzer plugin that:

Parses the text above
Sets token positions based on the {startTimeOffsetInMs}
E.g. "_sot_" - position 0, "hello" - position 600, "this" - position 1000 etc.
Allows time-based phrase search using interval filter like this:

...
"intervals": {
            "transcriptWithStartOffset": {
              "match": {
                "query": "hello this is",
                "filter": {
                  "script": {
                    "source": "interval.end < 1100"
                  }
                }
              }
            }
          }
...

The Problem

To enable this, we now store two fields per document:

transcript: plain text - required for other type of searches
transcriptWithStartOffset: custom-formatted version for time-based search

This doubles the storage for large transcripts — and that's becoming costly at scale!

We’re looking for guidance or ideas to reduce duplication while retaining time-based search features.

Specifically:

Is there a better way to store offsets without duplicating the transcript text?
Could offsets live in a separate field (like wordsWithOffsets: [{word: "hello", offset: 600}]) and still be usable in queries with custom analyzers?
Is there any supported way for a custom analyzer to read from:

Other fields in the document? (I've found the analyzer has no visibility into other fields of the document)
An external file or API for offset lookup?

Reduce the size using Index-level compression settings?

We understand analyzers are designed to be fast and isolated, so we’re open to alternative approaches — maybe using ingest pipelines, runtime fields, or transforms?

Thanks in advance for any ideas or best practices!

Topic		Replies	Views
Howto: Access Character Offset of term in string field Elasticsearch	4	581	July 6, 2017
Full text search : search phrase in text Elasticsearch	5	419	July 6, 2017
Search in the same phrase Elasticsearch	10	1993	July 6, 2017
TokenStream implementation in ElasticSearch Elasticsearch	6	867	July 6, 2017
ElasticSearch while browsing an index that is updated every couple of minutes Elasticsearch	8	406	July 6, 2017

Time-based Phrase Search Without Doubling Transcript field storage – Any Better Way?

The Problem

Related topics