Hackathon Project: Transcription using VCS and GStreamer

Options
Sanchayan
Sanchayan Member
edited January 2023 in Code Share

Summary

Going into the new year, had not decided on a project for the hackathon. Colleague suggested the idea of integrating transcription as part of the existing VCS feature already on offer. While this has been shared before already here and here, what remained to be seen was how good a sync could be achieved between actual spoken audio and transcription, when having transcription as part of the recording by default.

Eventually decided on trying out two approaches for this.

- Use existing transcription infrastructure and connect it to VCS

- Use a transcriber element as part of the GStreamer pipeline in the back end

Transcription using Daily's Video Component System (VCS)

It's already possible to have the transcript on top of daily recordings as shown here. This same approach is used, but, the transcription messages are now routed to VCS by default. In this approach, transcription is run as a separate service and not part of the GStreamer services in back end which are responsible for streaming and recording. And here are the results.


Transcription using GStreamer

In this approach, we rely on GStreamer running as part of back end services. For the second approach, an existing GStreamer transcriber block is used, after the audio mixer, which generates a transcription output. This output is then given to webvttenc block which generates a Web Video Text Tracks (webvtt) output. The transcription output is plain text and webvttenc generates timed text output output of that. WebVTT is a W3C standard for displaying timed text in connection with HTML5 <track> elements. While in the first approach, transcription is embedded as part of the recording, in this case, the transcription is in a separate .vtt file which can be used by standard players like VLC or browser.

Note: Change the extension of txt file to vtt before playing with VLC.

Learnings

From the two approaches considered, setting aside the accuracy of transcription itself, it can be seen that a better sync between the spoken audio and transcription is achieved where transcription is part of the GStreamer media pipeline itself. To improve this further, partial results have to be handled to improve latency and achieve better sync between transcription and audio.