pion play-from-disk example: understanding timing of writing audio frames to track

smittyplusplus · September 2023

In the course of working on app+services for a product at work, I'm getting into and learning about webrtc. I have a backend service in golang that is producing audio that will be sent to clients, and for this I started with and adapted this pion play-from-disk sample. The sample is reading in 20 ms pages of audio and writing them to the audio track, every 20 ms.

This feels extremely fragile to me, especially in the context of this service I'm working on where I could imagine having a single host managing potentially hundreds of these connections and periodically having some CPU contention (though there are knobs I can turn to reduce this risk).

Here is a simplified version of the example, with an audio file preloaded into these 20 ms opus frames, just playing on a loop. This sounds pretty good but there is an occasional hitch in the audio that I don't yet understand. I tried shortening the ticker to 19ms and that might actually slightly improve the sound quality (reduces the hitches) but I'm not sure. If I tighten it too much I hear the audio occasionally speeding up. If I loosen it there is more hitch/stutter in the audio.

How should this type of thing be handled? What are the tolerances for writing to the track? I assume this is being written to an underlying buffer… How much can we pile in there to make sure it doesn't starve?

oggPageDuration := 20 * time.Millisecond

for {
    // wait 1 second before restarting/looping
    time.Sleep(1 * time.Second)

    ticker := time.NewTicker(oggPageDuration)
    for i := 0; i < len(pages); i++ {
 	if oggErr := audioTrack.WriteSample(media.Sample{Data: pages[i], Duration: oggPageDuration}); oggErr != nil {
		panic(oggErr)
	}
	<-ticker.C
    }
}

arun · September 2023

Hey @smittyplusplus, I'm not super familiar with Pion, but I can share something a bit of general knowledge around this.

Your intuition of this is right — you're sending data at the requisite 20ms interval, and there is buffering to avoid starvation, but there's a little bit more to the rabbit hole.

On one side you have a sender sending audio data in real time (either an audio input device that periodically wakes up and provides samples, or in your case a CPU timer that wakes at some interval and provides an encoded audio frame). On the other side, you have a receiver that is trying to play that data to its audio hardware (which wakes up periodically and asks for data).

There are two problems at hand:

Skew: Due to the physical nature of clock crystals, if you have two independent clocks, they will run at slightly different speeds — so your sender and receiver have slightly different notions of what "20ms" means (the CPU and audio clock might be drive by different crystals, so you might have skew even on the same device)
Jitter: There are some variable delays in audio data travelling from your sender to your receiver — this might be due to network jitter (packets might be delayed to varying amounts based on wifi quality, other traffic at intermediate routers, …) or things like scheduling jitter (which you allude to)

So effectively, what your receiver is trying to do is (a) estimate the skew between the sender and the receiver, and (b) minimise the effects of jitter (on the skew calculation, as well as avoiding starvation).

The receiver typically does this by maintaining a "jitterbuffer" that tries to smooth out the variability of arriving packets by correlating the timestamp on RTP packets (which should roughly estimate when the sender "captured" the data) and the actual arrival time of the packet (which effectively includes all the jitter elements). This is might be done via a filter + linear regression, for example.

Finally, on the basis of the skew, you now need to match the rate at which your receiver needs audio data to fill its device buffer with the rate at which the sender is sending. You might do this by adding gaps (sender is slower than receiver) or overlapping samples (sender is faster than receiver), or you might resample to try to match the two based on your skew estimate. The latter is what browsers (via libwebrtc typically do).

In addition to this there is some buffering in the audio subsystem (and possibly in the WebRTC stack as well) to deal with starvation due to processing on the receiver side. Audio threads also generally get real time priority since audio glitches make for a poorer user experience than a dropped video frame, for example.

So to answer your question, if the jitter is "bad", you might occasionally here dropouts or some speeding up/slowing down of audio, but if you're sending 20ms of data every 20ms within reasonable tolerance, the receiver should be able to adapt.

smittyplusplus · September 2023

Thank you @arun this was very informative, and has given me a lot of specific details to look into when I start trying to improve/optimize this in the near future.

One quick follow up on this: on the producer side, would it make sense to write slightly more frequently until the buffer is full-ish (assuming I have access to monitor that through the apis I'm using), and then back off to give more tolerance? Or would that inadvertently cause the end user to hear speedup?

arun · September 2023

On the producer side, you might not get much feedback on buffer fullness level, so just sending out data at a consistent cadence. There's may be a little send-side buffering (accumulating data for the encoder and the network buffer while sending out data), but usually you're trying to get the data out to the receiver with minimum delay.

It's easier to think of this in terms of reading from an audio capture device — you don't (usually) get to choose the rate at which data is available, you just send it out in real time as an appropriate chunk of audio is available. If there are issues sending the data out (for example network congestion), you might adjust the encode bitrate, but this is more common for video, which takes much more bandwidth.

The receiver's where you do the buffering (for network and scheduling jitter), rate matching (for skew), etc.

smittyplusplus · September 2023

Nice, thanks again.

pion play-from-disk example: understanding timing of writing audio frames to track

Comments

Categories