Advice on indexing long form text

I'm trying to find the best and most efficient way to index a transcript of long conversations, for example a 2-3 hour podcast between multiple members, or for another example a transcript of a large Zoom meeting with many speakers.

On average, there would be about 5-10 different speakers that will talk for about 2 hours and I have many of these podcasts that I'd like to index so I have a lot of text and I'm wondering how to index it into Elasticsearch in the best way that will allow me to do simple full text searches on the contents in a way that will allow me to either search for a specific phrase within a specific podcast transcript or a general full text search on all transcripts from all podcasts.

The format of the text is something along the lines of:

<time> <participant name>: <participant transcript>

Which I can break into whatever other format via code

Any suggestion on the best approach for this use case?

Thanks!

To make your queries easier and allow you to do more filters, I would add the podcast name and podcast episode as different fields, and combine those two fields to generate a podcast id field to help you filter all the transcript from that particular episode.

You could also have fields with the hosts and participants to allow you to filter all the podcasts from an specific host or participant and maybe a field to store the topic of the podcast.

You could also go further and create some fields to get metrics or classification, like the lenght of the transcript, or maybe the duration, if it is possible to get this information from your source.

So you would end up with a document like this:

{
    "podcast": {
        "name": "podcast name",
        "title": "podcast title",
        "hosts": ["list", "of", "hosts"],
        "participants": ["list", "of", "participants"],
        "id": "id derived from name + title",
        "category": ["list", "of", "category"]
    },
    "transcript": {
        "content": "the full text content of the transcript",
        "participant": "the name of the participant",
        "time": "the time of the transcript",
        "length": "the length of the transcript",
        "duration": "the duration of the transcript"
    },
    "timestamp": "timestamp of the document, could be a copy of the transcript.time"
}

For full text search, the mapping of the field with the transcript, in this example transcript.content, needs to be of the type text, the other string fields could be mapped as keyword, the numeric fields length and duration as long and the time field as date.

Does this help?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.