Indexing Subtitles and Maintaining Timestamps

ADarkDividedGem · September 25, 2023, 6:36am

I am wanting to index subtitles and also maintain the timestamp data.

My initial thought was to make each line of text a document with the start and end timestamps stored as fields for that document. For example the following two lines will be stored as two documents (e.g. start, end, text, title, date):

01:33:22.285 --> 01:33:24.365
Lorem Ipsum is simply dummy

01:33:24.365 --> 01:33:27.485
text of the printing typesetting

However, if I was to search for text : "simply dummy text" no single document contains that exact phrase so no results are returned. If I wanted the searches to be useful I would have to index every line of text as one document but I would loose the timestamp data.

Does anyone know how I can store the timestamp data since some subtitle files contain 3 hours worth of text and any questions answered by the search results will be directly followed by when exactly did this speech occur.

dadoonet · September 25, 2023, 7:08am

Could you share a typical document you are indexing so far?

ADarkDividedGem · September 25, 2023, 8:20am

Just your usual WebVTT file. For example I have one file for a 2 hour show that is 229KB and contains 3,730 captions each containing a start time, end time and one line of text no longer than 33 characters.

This means my mappings are as follows:

mappings = {
    "properties": {
        "episode_id": {"type": "integer"},
        "episode_title": {"type": "text", "analyzer": "english", "fielddata": "true"},
        "episode_start": {"type": "date", "format": "date_time_no_millis"},
        "subtitle_start": {"type": "date", "format": "date_time"},
        "subtitle_end": {"type": "date", "format": "date_time"},
        "subtitle_text": {"type": "text", "analyzer": "english", "fielddata": "true"},
    }
}

dadoonet · September 25, 2023, 9:04am

In that case, you would index every single caption as a document, right?

Something like:

POST /captions/_doc
{
  "episode_id": 1,
  "episode_title": "Foo Bar Baz",
  "episode_start": "2023-09-25",
  "subtitle_start": 5602285,
  "subtitle_end": 5604365,
  "subtitle_text": "Lorem Ipsum is simply dummy"
}
POST /captions/_doc
{
  "episode_id": 1,
  "episode_title": "Foo Bar Baz",
  "episode_start": "2023-09-25",
  "subtitle_start": 5604365,
  "subtitle_end": 5607485,
  "subtitle_text": "text of the printing typesetting"
}

Is that correct? If so, what is wrong with that?

ADarkDividedGem · September 25, 2023, 9:11am

Yes that is correct but as I mentioned in my original post searching for the exact phrase "simply dummy text" returns no results. This is because that exact phrase spans across two captions and therefore two documents.

dadoonet · September 25, 2023, 9:15am

I understand now the problem...
Thanks.

May be you need to do something "smarter", like adding text before to your subtitles. Ideally full sentences but that'd mean that you have to detect what a sentence is...
So may be index something like:

POST /captions/_doc
{
  "episode_id": 1,
  "episode_title": "Foo Bar Baz",
  "episode_start": "2023-09-25",
  "subtitle_start": 5604365,
  "subtitle_end": 5607485,
  "subtitle_text": "text of the printing typesetting",
  "subtitle_text_with_previous": "Lorem Ipsum is simply dummy text of the printing typesetting"
}

I don't see a way to do that automatically in Elasticsearch.

aaron_ximm · September 26, 2023, 5:28pm

AFAI know, the short answer is, this requires application-side document processing. (Unless you make one document per line, which is likely too much overhead.)

I can tell you that this is what we do in several applications:

index the text, for matching purposes, with all subtitles (and eg. soft line breaks) removed
for each hit to be displayed/used, we dereference by scanning the original text to find the highlighted snippet

This is obviously quite expensive and only viable for small result sets; it also requires storing the original time-code decorated document.

I have done this with this exact ASR format ftr.

There are some corner cases to be aware of, e.g. multiple occurrences of the same text in the document. In the worst cases you may not have sufficient context to disambiguate.

aaron_ximm · September 26, 2023, 5:32pm

Idle idea, if performance is critical you could perhaps store in your search documents an inverted index of some kind. I can imagine tree data structures that would be quite expensive to generate but very fast to consult to get back the timecode offset at a word resolution.

system · October 24, 2023, 5:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing closed caption files: SAMI, TTML or WebVVT Elasticsearch	1	608	July 5, 2017
Keyword search using Elastic search Elasticsearch	3	698	November 16, 2019
How can i search data from two indexes in Elastic search Elastic Search elastic-app-search , elastic-site-search	7	1059	March 15, 2024
Adding @timestamp to ES index turns in no results on kibana Kibana	5	2495	May 18, 2018
Current timestamp Elasticsearch	2	459	December 14, 2018

Indexing Subtitles and Maintaining Timestamps

Related topics