WebRTC – Go beyond the APIs Part 1: Signaling and Connecting

We were in the middle of the pandemic. Many of us have to get accustomed to the new norm, stay at home to work and study, etc.…We are separated and physically disconnected from our relatives, family, and coworkers. Due to the demand for connecting socially on the Internet, many real-time communication technologies have emerged to meet people’s needs. One of the most prominent technologies that empower video calls and conferencing is WebRTC – an open-source project that enables real-time video/audio communication. With WebRTC, you can exchange data and media with others directly through a Peer-to-Peer connection without installing any intermediary plugin. It is also used by many great applications such as Discord, and Google Meet, to name a few.

If you’re familiar with WebRTC, perhaps you primarily interact with its JavaScript APIs. In this article, we are not going to talk about its APIs or how to use it, but rather, this article is tailored for people who are curious about how WebRTC works under the hood and how to answer a simple question: “How do people get connected and able to talk to other through WebRTC?”

There is a lot of stuff going on in establishing a connection until two peers are fully connected and directly talk to each other.

First, let us briefly talk about the concept of the peer-to-peer network. Traditionally, we have our device acting as a client and sending some requests to a server; the server notices this request and sends back to you some response. However, with P2P, each computer has the same role as a client and a server simultaneously.

For now, let’s dive into how each computer gets connected and communicates with others by using the P2P network.

Signaling

Initially, peers or so-called WebRTC agents have no idea who to connect with and how. The signaling process is the bootstrapping process that makes a call possible. During this process, the basic information about the transport address (IP address, port number, and the protocol being used), media type, and other data necessary for starting the call will be granted.

However, the signaling mechanism isn’t specified by WebRTC itself. You can freely choose any protocol as long as peers are connected after this process. Still, people decide to use WebSocket as the protocol to transfer signaling information for the most part.

To do that, WebRTC uses an existing protocol called Session Description Procol (SDP). Through this protocol, two peers will share all states required for establishing the connection, such as the transport address going to use, the media type, and some information on how to secure a connection that both WebRTC agents have to agree upon.

The Session Description Protocol is defined in RFC 4566. SDP is a text-based protocol in which each line contains a key-value pair. Reading the SDP might be intimidating at first. If you look at an SDP packet, you can see a key-value pair, then a new line. WebRTC doesn’t use every keyword of the SDP, it only uses some of them! Let’s look at some of the most used keywords and their values that we might encounter while reading the SDP packet:

v: version, it should be 0
o: origin, the unique ID which is useful for the renegociation process
s: session name, the name of the session
t: timing, should be equal 0 0
m: media description, described below
c: connection data, should be equal IN IP4 0.0.0.0
a: attribute information, most common line in WebRTC SDP

And here is an example of an SPD packet in WebRTC:

v=0
o=- 0 0 IN IP4 127.0.0.1
s=- c=IN IP4 127.0.0.1
t=0 0
m=audio 4000 RTP/AVP 111
a=rtpmap:111 OPUS/48000/2
m=video 4002 RTP/AVP 96
a=rtpmap:96 VP8/90000

Let’s take a closer look at the media description here and its attribute:

m=audio 4000 RTP/AVP 111
a=rtpmap:111 OPUS/48000/2

This media description contains a list of media formats, which will eventually be used in the Payload type in the RTP packet (RTP is the protocol for transferring media, more on this and the payload type of RTP later). The last value of this media description 111 , this is the payload type and will be mapped to a specific codec on an attribute value right below it, this attribute maps the 111 payload type to the codec named Opus.

We can also see another media description in this example, this second description holds the video information, and the payload type value is 96 and is mapped to the VP8 codec.

Anyway, a codec is a device that encodes and decodes media streams and signals.

Ok, so what does all of this SDP information use in WebRTC? It turns out that WebRTC uses the offer/answer model (if you look at the APIs, you will also see something similar to this). For example, we have Alice and Bob want to talk to each other using WebRTC. First, Alice will create an SPD packet and sends it as an offer to Bob. Bob receives this offer, and he decides whether to accept or reject this offer by sending an SPD packet as an answer to Alice. If both Alice and Both agree upon media, codec, transport address, and some other information provided by the other, they can start communicating.

I have mentioned “some other information” quite a bit when talking about the SDP. We should now talk about these values in a greater level of detail. For example, look at this complete example of the SDP format used in WebRTC:

v=0
o=- 3546004397921447048 1596742744 IN IP4 0.0.0.0
s=-
t=0 0
a=fingerprint:sha-256 0F:74:31:25:CB:A2:13:EC:28:6F:6D:2C:61:FF:5D:C2:BC:B9:DB:3D:98:14:8D:1A:BB:EA:33:0C:A4:60:A8:8E
a=group:BUNDLE 0 1
m=audio 9 UDP/TLS/RTP/SAVPF 111
c=IN IP4 0.0.0.0
a=setup:active
a=mid:0
a=ice-ufrag:CsxzEWmoKpJyscFj
a=ice-pwd:mktpbhgREmjEwUFSIJyPINPUhgDqJlSd
a=rtcp-mux
a=rtcp-rsize
a=rtpmap:111 opus/48000/2
a=fmtp:111 minptime=10;useinbandfec=1
a=ssrc:350842737 cname:yvKPspsHcYcwGFTw
a=ssrc:350842737 msid:yvKPspsHcYcwGFTw DfQnKjQQuwceLFdV
a=ssrc:350842737 mslabel:yvKPspsHcYcwGFTw
a=ssrc:350842737 label:DfQnKjQQuwceLFdV
a=msid:yvKPspsHcYcwGFTw DfQnKjQQuwceLFdV
a=sendrecv
a=candidate:foundation 1 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 2 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 1 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=candidate:foundation 2 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=end-of-candidates
m=video 9 UDP/TLS/RTP/SAVPF 96
c=IN IP4 0.0.0.0
a=setup:active
a=mid:1
a=ice-ufrag:CsxzEWmoKpJyscFj
a=ice-pwd:mktpbhgREmjEwUFSIJyPINPUhgDqJlSd
a=rtcp-mux
a=rtcp-rsize
a=rtpmap:96 VP8/90000
a=ssrc:2180035812 cname:XHbOTNRFnLtesHwJ
a=ssrc:2180035812 msid:XHbOTNRFnLtesHwJ JgtwEhBWNEiOnhuW
a=ssrc:2180035812 mslabel:XHbOTNRFnLtesHwJ
a=ssrc:2180035812 label:JgtwEhBWNEiOnhuW
a=msid:XHbOTNRFnLtesHwJ JgtwEhBWNEiOnhuW
a=sendrecv
a=candidate:foundation 1 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 2 udp 2130706431 192.168.1.1 53165 typ host generation 0
a=candidate:foundation 1 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=candidate:foundation 2 udp 1694498815 1.2.3.4 57336 typ srflx raddr 0.0.0.0 rport 57336 generation 0
a=end-of-candidates

It can look scary for us, luckily we all studied all of these keywords. The matter here is we need to understand their corresponding values. For example, let’s started with this line:

fingerprint:sha-256

This attribute tells us the hash of a peer’s certificate in the DTLS handshake process, and this is used for securing the call (more on this later).

group:BUNDLE 0 1

group:BUNDLE means that multiple media streams will all go together through a single connection. For example, Alice has an SDP offer to Bob like this:

m=audio 9 UDP/TLS/RTP/SAVPF 111
m=video 9 UDP/TLS/RTP/SAVPF 96

Both video and audio streams as in these media descriptions will share a single port number, meaning when there is inbound media traffic to Alice, Alice will use a single port to listen to both audio and video.

setup:active

The value followed after setup determines which peer is a client and which is a server during the DTLS handshake process (this process happens when the connection between 2 peers is established – after the ICE process, more on this later):

active: this peer acts as a DTLS client
passive: this peer acts as a DTLS server
actpass: ask the other peer to choose which is which

ice-ufrag:CsxzEWmoKpJyscFj

This ice-ufrag value determines the user fragment of an ICE agent, it’s then used for authentication of the ICE traffic.

ice-pwd:mktpbhgREmjEwUFSIJyPINPUhgDqJlSd

The ice-pwd is the ICE password, its value is used for authentication of the ICE traffic.

rtpmap:111 opus/48000/2

We already saw this one, and this is used for mapping a payload type in the media description to a specific codec.

fmtp:111

This fmtp attribute determines additional for a payload type of the media description.

a=sendrecv

This line determines the type of media transceiver. There are a few different properties that a transceiver can have:

sendrecv: means this transceiver both sends and receives media
send: means this transceiver only sends media, and don't want to receive anything
recv: means this transceiver only receives the media
inactive: means that this transceiver doesn't intend to send or receive any media

transceiver is a specific concept used in the WebRTC API, in which each media description is a transceiver?

a=candidate:foundation 1 udp 2130706431 192.168.1.1 53165 typ host

Next, we have the candidate attribute; this attribute specifies the available transport address of one peer so that the other peer can send traffic and media. Because one peer can have multiple transport addresses (e.g., public IP address, address get from the server (relayed address), or even its local address), that’s why we can see multiple candidates in an SDP (some of these candidates can be explored through a STUN server, which describes later).

a=ssrc:2180035812 mslabel:XHbOTNRFnLtesHwJ

ssrc is the synchronization source number defines a single media track, and mslabel property defines a unique ID of a container that holds multiple streams.

Connecting

After the signaling process, the necessary information has been shared between peers to initiate the call, and now we need to find a way to connect them. When exchanging SDP, both send available candidates to each other that it wishes the other peer route traffic to. But first, where do these candidates come from?

The candidates can come from multiple sources. Some of them are derived from a physical or logical interface. Some of the candidates can be discovered through the TURN, STUN server. Before we move on to the concepts of STUN or TURN, let’s first revise some concepts of NAT.

NAT

Each IPv4 is stored using 32 bits, so totally we have roughly 4 billion unique IP addresses, that sounds like a lot, but in reality, they are slowly running out. NAT (Network Address Translation) was created to mitigate this problem, instead of each computer having its own public IP address, if a computer behind a NAT and sits behind the local network, then every computer in this network will only have one public IP address, and each of them is assigned a private IP address such as 192.168.1.1, 192.168.1.2, etc…(class C) or 172.16.0.1, 172.16.0.2, etc… (class B). NAT will do the mapping for us of mapping multiple private local IP addresses into a public one. For example, when a computer (computer A) is inside a local network with the IP of 192.168.2.2 want to send an outgoing request to an external network (to computer B), a mapping table will be created by NAT to map the private address of the computer A to a public one (for example 8.8.8.8), then thanks to NAT, the public address will be used as a source IP address to send the traffic to computer B, when computer B sends traffic back to A, then NAT will map the public IP address to A’s private IP address. There are different types of NAT mapping, you can check them out here.

STUN

The problem with NAT is that we cannot establish a connection between 2 peers in different networks by just knowing their private IP addresses. That’s why a STUN (Session Traversal Utilities for NAT) server is needed, a peer behind NAT will send a request to the STUN server, and the STUN server will send back a public IP address of this peer as it observes. Once the public IP addresses of peers are obtained, this information can be used as ICE candidates in the signaling process.

TURN

Sometimes even when we get the public addresses of peers from a STUN server, we still cannot establish the P2P connection because of many reasons, such as UDP traffic being forbidden or any peer behind the symmetric NAT (different mapping is used for different destinations). When a direct connection is not possible, the TURN (Traversal Using Relays around NAT) server can be used as a relay server, instead of sending traffic directly, each peer will send UDP/TCP traffic to the public TURN server, and the TURN server will “relay” the traffic to the destination peer. Conceptually, a peer first must send an “Allocate” request to the server indicating it requires some resources to establish a connection with its peers, if the allocation request is possible, the TURN server will respond with the “Allocation Succesful” which will contain the allocated relay transport address in the TURN server of the requesting peer.

Facebook0 Tweet0

WebRTC – Go beyond the APIs Part 1: Signaling and Connecting11 min read

Signaling

Connecting

NAT

STUN

TURN