Friday, February 13, 2009

Robust Video Server Series

Today's internet is teeming with video content. As content expands, the need for scalable, high-performance video servers will also expand. This series addresses the fundamental components of a robust media server from the ground up. Please subscribe to keep up with the posts! Over the next few months I'll be writing on each of the building blocks necessary for a complete implementation:
  1. RTSP
  2. NAT
  3. STUN
  4. ICE
  5. RTSP NAT
  6. RTP/RTCP
  7. MPEG
  8. Error Correction
  9. Distributed Systems
  10. and more...
The series will conclude with one or more culminating posts, describing how all these basic components are strung together into a full-fledge video server. Enjoy!

Sunday, February 8, 2009

STUN: NAT Traversal

As a software developer, how do you get around the limitations of NAT? STUN - Session Traversal Utilities for NAT (RFC5389) - provides a simple tool for just this purpose. STUN servers and clients are commonplace today, but how do they work? Why do we use them? And what does the average developer need to know?

Prerequisite: NAT Primer or a basic understanding of NAT

Given a decent understanding of NAT, we know that a client behind a NAT device cannot receive inbound packets unless a binding has been established. In the context of UDP, bindings are created by first sending an outbound UDP packet from client to server. This initial packet sets up some state on the NAT device thus allowing subsequent traffic from server to client to proceed unhindered. STUN Binding requests serve this very purpose and more.

For example, suppose a video player client wishes to receive a stream of video packets. The client has somehow learned the video server's source IP and port number (perhaps via RTSP). Two problems must be resolved before video can flow from server to client:
  1. Open a binding on the NAT device
  2. Inform the server of the client's public IP address
Step one is a piece of cake. The client could simply send an empty UDP packet to the server's address. Unfortunately, this one-way transaction doesn't offer enough information to complete step two.

Step two is a little more difficult. The client needs to learn the dynamic mapping that the NAT device created for the first outgoing UDP packet. For example, the client may have sent an empty packet from its internal IP address (192.168.1.100:12345) to the server's public IP address (64.1.2.3:45678). Now, any incoming packets which reach the NAT on the port mapped for 192.168.1.100:12345 will forwarded accordingly. So what is the mapped port? STUN provides the answer.

Using STUN accomplishes step one and easily facilitates step two. Here's how it works: First the video server listens for STUN packets on the port it will be using for video traffic. The client sends a STUN packet to that port on the video server. For the sake of this explanation, assume the STUN packet is just an empty UDP packet with a few attributes used to distinguish it (explained later). Once the server receives this STUN packet, it generates a response. In the response's body, the server will include the source IP and port of the original incoming packet. This address is actually an external ip:port of the NAT device which was dynamically bound to the client's internal address and port earlier (by the first outgoing STUN packet). Once generated, the response is sent back to the NAT device, the NAT device forwards it to the client, and the client receives it. Finally, the client interprets the response's contents and learns its own external address. The client may use this information to refer services. For example, the video server can be told the address and port that it should use for sending video traffic.

One benefit of STUN, over other NAT traversal utilities, is its ability to cope with a complex hierarchy of NAT devices. Protocols such as UPnP lack this ability.

STUN also offers security. Each STUN packet contains a number of key-value attributes. MESSAGE-INTEGRITY is one such field. It provides an encrypted hash of the packet. The server knows the client's encryption key by some other means and it may use that key to authenticate the inbound STUN packet.

Other attributes include the following (refer to RFC5389 for a complete list):

MAPPED-ADDRESSReturned by the server in its response to convey the observed source IP and port of the client.
MESSAGE-INTEGRITYExplained above, used for authentication.
FINGERPRINTUseful for distinguishing STUN packets from other packet types.
SOFTWARETextual description of the software application being used by the agents.
ALTERNATE-SERVERInstructs client to use a different STUN server.

The STUN RFC (5389) is short and certainly a worthwhile read. Note that STUN is not a complete NAT traversal solution. It is merely a NAT traversal "utility." In a future post, I will discuss ICE, Interactive Connectivity Establishment, which puts STUN to work in a complete NAT traversal solution.

If you're looking to get started with STUN and NAT traversal right away, but don't want to spend time implementing your own stack, check out the following:
  1. STUN Client and Server written in C++
  2. stun4j written in Java
  3. JSTUN written in Java
Note that most existing STUN implementations are RFC3489 compliant (which was replaced by RFC5389). If you know of a good RFC5389 implementation, please leave a link in the comments or send me an email and I will add it to this post.

Wednesday, February 4, 2009

NAT Primer

If you're reading this text, there's a good chance your computer is sitting behind a home gateway or router: a NAT device. NAT devices come in many names and flavors. Despite all that, even the minimally network-literate can configure one.

So why do we use them? How do they work? What does a software developer need to know?

Home gateways provide a great deal of functionality for such a small package:
  1. Switching
  2. Routing
  3. Wireless (802.11) Access
  4. NAT
Each of these features deserves an article in its own right but for today we'll just limit our attention to NAT. It may not be the most interesting, but it certainly poses the biggest problem to your average developer.

NAT stands for Network Address Translation. The concept of NAT was conceived as a temporary solution to IPv4 rapid address depletion (see RFC1631 from 1994).

So what is it? In short, NAT allows multiple network devices to share a single public IP address by providing each device with its own private IP address. This approach easily beat out other time-sharing approaches of the decade. To this day, NAT continues to stave off address depletion (just barely).

How does it work? Let's say you have a Linksys wireless router and at least half a brain. The router probably has one ethernet interface marked for the WAN (wide area network AKA the internet). Having at least half a brain, you plug in the ethernet from your modem (cable, 56K, fiber optic, whichever). The router is automatically assigned an IP address from your ISP (internet service provider), 64.1.2.3 for example. To the rest of the world, 64.1.2.3 is the IP for every device that connects to your home network.

On the LAN (local area network), the picture is a bit different. The router is responsible for serving up unique IP addresses to all its clients and even to itself. Each device has a distinct private address. IANA has reserved three ranges for just this purpose; we'll choose the class C block: 192.168.0.0 - 192.168.255.255. You may recognize this address range from your own escapades into home networking and you may also recognize that a Linksys router will normally assign itself 192.168.1.1. Each other device is assigned a distinct address within that range. We'll assume your computer has connected to the router and you've been given IP address 192.168.1.100.

Great. Everyone has an IP address; the gateway (router) even has two to call its own. We'll refer to 192.168.1.1 as the gateway's private address and 64.1.2.3 as its public address. Let's talk about NAT.

Obviously, your 192.168.1.100 address can't be used on the public internet, since it's registered as a private address. NAT takes care of this problem transparently. Suppose, just for the example's sake, that you're interested in participating in a multiplayer game. Your computer sends out a UDP request to join the game. The outbound packet uses source address 192.168.1.100 and port number 12000. As the packet passes through the NAT (on your home router) the source address and port are translated on the fly. The NAT quickly chooses an external port number: 54000. The packet's source, as it's sent onto the public internet, is translated to 64.1.2.3:54000 instead of 192.168.1.100:12000. This translation process is NAT.

What happens if the server sends back a response? There's a bit a magic going on behind the scenes in your home router. When that first UDP packet was translated, some state was saved: a "NAT binding" was created dynamically. This binding is essentially a mapping from 192.168.1.100:12000 to 64.1.2.3:54000 that lasts at least 30 seconds if left unused. If traffic across this binding is active, it could remain open indefinitely. Let's say the game server sends back a response towards 64.1.2.3:54000. When the response reaches the home router, the router does a look-up of port 54000 in its internal bindings table and finds the mapping to 192.168.1.100:12000. With a mapping in place, the destination address and port are translated from 64.1.2.3:54000 to 192.168.1.100:12000 and the packet is sent on its merry way.

I should also note, for completeness, that some NATs filter inbound traffic by source address and port. So although public port 54000 is mapped to 192.168.1.100:12000, if the inbound traffic is coming from the wrong place it won't be forwarded. For example, in the strictest form of a NAT filter (symmetric), the router will only allow inbound packets with the same source address and port as the initial outbound packet's destination to pass through. In our example above, only packets with the game server's address and port would be allowed. Other types of NAT are less restrictive, but as a software developer, you should plan for the worst.

Wow, quite a bit of explanation for such a common device. Translation happens for every single packet sent out from a private network onto the public network. As you might imagine, packet processing is hardware accelerated for maximum performance.

So what's the issue? When there are just two agents involved (a client behind a NAT and a server on the public internet) this model works just fine. Big issues arise for two reasons:
  1. Referrals
  2. Peer-to-Peer
Referrals take place whenever a protocol is used to convey address/port information inside its message body. For example, when a video-on-demand session is set up, the client might tell the server that it wants to receive video traffic at address 192.168.1.100 on port 12000. However, the negotiation protocol is taking place on some other port so there's no binding in place yet for port 192.168.1.100:12000. On top of that, the server doesn't know the location of 192.168.1.100, since it's a private address.

Peer-to-peer exchanges are extremely common on the modern internet. Let's take Skype for example, a VoIP service. Skype's services do not route all calls through their own servers because it simply isn't scalable. Instead, each call participant announces its IP address and port for receiving audio packets. If one or more of these participants is behind a NAT, we run into trouble. (Well, Skype doesn't since their developers already solved this issue, but if you're writing your own Skype knock-off you will.)

Fixing these issues is a little beyond the scope of a "NAT Primer" and this post is already getting lengthy. Over the next few weeks I'll explain the different methods of dealing with NAT limitations. If you'd like to start brainstorming, consider that bindings are only created from outbound traffic and a binding needs to be in place for a private address to receive inbound traffic. If you're chomping at the bit to learn the solution, search for STUN, TURN, and ICE on your favorite engine (unless your favorite is cuil, then suck it up and go to google).

Sunday, February 1, 2009

RTSP

Ever wonder how real-time content is controlled? Me too.

One option is RTSP: the Real Time Streaming Protocol. This jewel of 1998 is a classic of the web boom era. I mean, come on, who doesn't like text-based syntax? Unfortunately, the cold truth is that RTSP is more like the crazy inbred cousin of HTTP than the prince of online video content it could have been. Video is everywhere today, RTSP is not. Why not? Well, let's take a look...

RTSP has eleven methods:
  1. SETUP
  2. TEARDOWN
  3. PLAY
  4. PAUSE
  5. RECORD
  6. ANNOUNCE
  7. DESCRIBE
  8. GET_PARAMETER
  9. SET_PARAMETER
  10. REDIRECT
  11. OPTIONS
Each method listed here can be sent between an RTSP server and client via either UDP or TCP.

To better understand this protocol, let's follow a typical session. The user has his nice-looking RTSP client GUI ready to go, a blank address bar anxiously awaiting input. Calvin sits down and enters:

>> rtsp://192.168.1.150/spartacus.avi

What happens now? First, the client establishes a TCP connection with the RTSP server on port 554 (RTSP). The client also opens up a UDP socket to receive incoming video traffic. A SETUP request is sent:

SETUP rtsp://192.168.1.150/spartacus.avi RTSP/1.0\r\n
Cseq: 1\r\n
Transport: RTP/AVP/UDP; unicast; destination=64.2.3.2; client_port=32884\r\n
\r\n

If the server understands the message and recognizes the URL, it will return a SETUP response:

RTSP/1.0 200 OK\r\n
Cseq: 1\r\n
Session: 5748271\r\n
Transport: RTP/AVP/UDP; unicast; destination=64.2.3.2; client_port=32884\r\n
\r\n

The server has now established a state machine for this new RTSP session and a unique session identifier has been assigned (5748271). The SETUP response echoes back the connection sequence number and transport header. All-in-all the server is ready to start streaming content. (woohoo)

From the ready state, a client may issue any method. Some require the session identifier (such as PLAY, PAUSE, and RECORD) while others do not (such as DESCRIBE, and OPTION). Suppose Calvin wants to start watching "spartacus.avi". As you may have guessed, a PLAY is in order:

PLAY * RTSP/1.0\r\n
Cseq: 2\r\n
Session: 5748271\r\n
Range: npt=0.0-\r\n
Scale: 4.0\r\n
\r\n

The client includes the session identifier provided in the SETUP response. The connection sequence number is incremented. A Range header specifies the actual video time range for which the stream should be transmitted. In Calvin's example above, the range is a "normal play time" format indicating that the video content should be played starting from the beginning. The Scale header indicates the content speed (not the bit rate). Calvin has chosen to fast forward at 4x speed.

So while Calvin's looking for his favorite scene, let's discuss a few of RTSP's shortcomings thus far. Sure, we have the ability to set up, play, and pause content, but what's missing? How about NAT traversal, should that be part of RTSP? Or how does RTSP account for buffering? Is RTSP even a reasonable approach for clips shorter than 10 minutes?

In truth, RTSP's usefulness broke down in a world of short videos and high-memory clients. For that limited 1998 hardware model, the client was assumed to lack sufficient memory for buffering an entire piece of content. Content was assumed to last dozens of minutes, if not hours. Thus, storing content on a media server and controlling it remotely was a reasonable solution. When reality kicked in, clients had more than enough memory. Content, for the most part, lasted just a couple of minutes. The very problems RTSP attempted to solve no longer existed.

However! Over the last few years, long term content and low-memory embedded devices have re-emerged. Many content providers offer feature length video via the internet. Handheld devices have become video-capable. Some ISPs are starting to offer set top boxes with ethernet interfaces rather than coaxial. Some homes are re-adopting the terminal-mainframe model by keeping a single high-capacity media server along with several thin clients for viewing.

On top of all that, RTSP is born again.

Alright, Calvin glimpses a his scene while he's fast forwarding. Time to pause:

PAUSE * RTSP/1.0\r\n
Cseq: 3\r\n
Session: 5748271\r\n
Range: npt=now-\r\n
\r\n

Aha, found that scene. Calvin issues another PLAY, now with a normal scale:

PLAY * RTSP/1.0\r\n
Cseq: 4\r\n
Session: 5748271\r\n
Range: npt=now-\r\n
Scale: 1.0\r\n
\r\n

Once Calvin has satisfied his cinematic cravings, he stops the video and allows the server to release the session:

TEARDOWN * RTSP/1.0\r\n
Cseq: 5\r\n
Session: 5748271\r\n
\r\n

Of course, there's a lot more to RTSP. If this article has piqued your interest, I suggest reading RFC2326 http://tools.ietf.org/html/rfc2326. It's a breeze, really. If your eyes are truly on the future, take a look at http://tools.ietf.org/html/draft-ietf-mmusic-rfc2326bis-19. RTSP 2.0 is still in draft, but it's on the move.

As devices continue to shrink, thin small-footprint video clients will become more and more prevalent. Need a fun project? Write a compact RTSP library... You may not have the next Apache Web Server, but who knows where your efforts may take you.

- John "I am Spartacus!" Calthrup