I am wanting to index subtitles and also maintain the timestamp data.
My initial thought was to make each line of text a document with the start and end timestamps stored as fields for that document. For example the following two lines will be stored as two documents (e.g. start, end, text, title, date):
01:33:22.285 --> 01:33:24.365
Lorem Ipsum is simply dummy
01:33:24.365 --> 01:33:27.485
text of the printing typesetting
However, if I was to search for text : "simply dummy text" no single document contains that exact phrase so no results are returned. If I wanted the searches to be useful I would have to index every line of text as one document but I would loose the timestamp data.
Does anyone know how I can store the timestamp data since some subtitle files contain 3 hours worth of text and any questions answered by the search results will be directly followed by when exactly did this speech occur.
Just your usual WebVTT file. For example I have one file for a 2 hour show that is 229KB and contains 3,730 captions each containing a start time, end time and one line of text no longer than 33 characters.
Yes that is correct but as I mentioned in my original post searching for the exact phrase "simply dummy text" returns no results. This is because that exact phrase spans across two captions and therefore two documents.
May be you need to do something "smarter", like adding text before to your subtitles. Ideally full sentences but that'd mean that you have to detect what a sentence is...
So may be index something like:
POST /captions/_doc
{
"episode_id": 1,
"episode_title": "Foo Bar Baz",
"episode_start": "2023-09-25",
"subtitle_start": 5604365,
"subtitle_end": 5607485,
"subtitle_text": "text of the printing typesetting",
"subtitle_text_with_previous": "Lorem Ipsum is simply dummy text of the printing typesetting"
}
I don't see a way to do that automatically in Elasticsearch.
AFAI know, the short answer is, this requires application-side document processing. (Unless you make one document per line, which is likely too much overhead.)
I can tell you that this is what we do in several applications:
index the text, for matching purposes, with all subtitles (and eg. soft line breaks) removed
for each hit to be displayed/used, we dereference by scanning the original text to find the highlighted snippet
This is obviously quite expensive and only viable for small result sets; it also requires storing the original time-code decorated document.
I have done this with this exact ASR format ftr.
There are some corner cases to be aware of, e.g. multiple occurrences of the same text in the document. In the worst cases you may not have sufficient context to disambiguate.
Idle idea, if performance is critical you could perhaps store in your search documents an inverted index of some kind. I can imagine tree data structures that would be quite expensive to generate but very fast to consult to get back the timecode offset at a word resolution.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.