Hi,
We’re building a solution that allows time-based phrase search over audio transcripts in Elasticsearch. The goal is to find exact phrases that occur at a certain point in time, e.g., "may i help you" within the first 5 seconds of a conversation; find a phrase "goodbeye" within the last 5 seconds of a conversation.
We preprocess our transcripts field to include word-level time offsets, like so:
Original "transcript": "hello this is example"
With offsets format ({word}|{startOffsetInMs}
) "transcriptWithStartOffset": "_sot_|0 hello|600 this|1000 is|1300 example|1600"
"_sot_" stands for "start of transcript" and is used just as a zero-based token.
We created a custom analyzer plugin that:
- Parses the text above
- Sets token positions based on the
{startTimeOffsetInMs}
E.g. "_sot_" - position 0, "hello" - position 600, "this" - position 1000 etc. - Allows time-based phrase search using interval filter like this:
...
"intervals": {
"transcriptWithStartOffset": {
"match": {
"query": "hello this is",
"filter": {
"script": {
"source": "interval.end < 1100"
}
}
}
}
}
...
The Problem
To enable this, we now store two fields per document:
transcript
: plain text - required for other type of searchestranscriptWithStartOffset
: custom-formatted version for time-based search
This doubles the storage for large transcripts — and that's becoming costly at scale!
We’re looking for guidance or ideas to reduce duplication while retaining time-based search features.
Specifically:
- Is there a better way to store offsets without duplicating the transcript text?
- Could offsets live in a separate field (like
wordsWithOffsets: [{word: "hello", offset: 600}]
) and still be usable in queries with custom analyzers? - Is there any supported way for a custom analyzer to read from:
- Other fields in the document? (I've found the analyzer has no visibility into other fields of the document)
- An external file or API for offset lookup?
- Reduce the size using Index-level compression settings?
We understand analyzers are designed to be fast and isolated, so we’re open to alternative approaches — maybe using ingest pipelines, runtime fields, or transforms?
Thanks in advance for any ideas or best practices!