Update from the W3C TPAC 2022, Part 2

Options
vr000m
vr000m Dailynista
edited October 2022 in Tech Updates

When there is a v1 API, there are always discussions of a next version (NV) API. WebRTC was born in 2010, the world has changed since, especially in the way we build webapps.

Boldly, in 2022, we want Multi-threaded applications that can deliver both low-latency and large scale. We'd need to do this through low-level access to building blocks across the complete media pipeline, namely, capture, A/V processing (think machine learning), encode/decode, transport, and rendering.

This would require deconstructing the v1 media pipeline, which is an ongoing discussion. For historical context, the original webrtc-pc API was very opaque, more opaque than what we use today and did not have several objects that we use and love today. Namely, the RtpSender/RtpReceiver objects were introduced in 2016, 5 years after the initial work, and these objects gave us some level of low-level control. We can now attach tracks via code, as opposed to describe it via the Session Description Protocol (SDP). A good mental model of the pipeline's evolution is described by Jan-Ivar in the attached screenshot.

Now vs Past

In 2022, we have a lot more building blocks compared to 2011 or even 2016. For example (a list of newer things):

  • In addition to vanilla RTP, we now have new transport protocols, QUIC, HTTP/3, WebTransport, Media over QUIC (MoQ), DASH.
  • We have new codecs from ITU-T and AoMedia.
  • At the W3C/WHATWG we have new paradigms defined WHATWG Streams somewhat like a flow-based programming
  • We also have new media formats via WebCodecs, Media Source Extensions (EME) v2, and Encrypted Media Extensions (EME) for DRM.
  • We have the evolution to webrtc via RTCPeerConnection, RTCDataChannel, media-capture, screen-capture, mediacapture-transform, encoded-transform.
  • We have newer rendering via Canvas, WebGL, WebGPU, etc.
  • Lastly, we now have WASM.

Picking a few topics from the meeting that I think were interesting

Pluggable codecs -- there are several new codecs in the market: Video has Google's VP9 and AO Media's AV1 codecs, these update the VP8 codec, HEVC/H.265 and VVC/H.266 updates the H.264. Google and Microsoft have been working on lyra and satin audio codecs, which would essentially replace Opus in some low bitrate (5-25 kbps) scenarios.

Opus, H.264, and VP8 are mandatory to implement, but to get these newer codecs, we'd have to get all the browser vendors to implement them and that is a tad difficult. Thus, none of these codecs are currently available unless you're building a native application (where you can add any codec you want, it would require both users to be on the mobile app for the codec to be selected). Thus, there is a strong desire to make these newer codecs available via wasm and that these wasm libraries extend webcodecs, which webrtc could then use.

The primary work here is to make webcodec and webrtc work together. For example, RTCEncodedAudioFrame and EncodedAudioChunk and RTCEncodedVideoFrame and EncodedVideoChunk. In addition, data in webrtc is mutable while webcodecs it is not.

RtpTransport -- Another aspect to consider is does the webcodec provide an encoded stream? and does the application need to encode the webcodec output into RTP or should the webcodec spew out an RTP packet instead. If the webcodec does not encode the RTP payload format, then the browser would need to do that, unfortunately, we do not have a good way of encoding newer codecs unless these new codecs are backwards compatible with an existing payload format that the browsers have already implemented. Ideally, we would have an RtpTransport, so the webcodec could spew out the encoded bytestream and the RtpTransport could encode it and push it into the RtpSender that would send it on the network/peerconnection.

To summarise, there are two main focus areas for next version of webrtc,

  • new capabilities, especially, with the advent of client-side machine learning, where developers envision building moderation features using gaze detection, face/object detection. The spec webrtc-nv-usecases covers a slew of ideas: internet of things (iot), VR/AR, speech and video related machine learning, performance improvements by moving to workers, etc.
  • deconstruction of the webrtc-pc API, i.e., give developers more control over the the media pipeline, for example, how to introduce new codecs? (VP8 and H.264 were mandated, how does one get new codecs incorporated)

Happy to hear your thoughts on this evolution.

Tagged: