How Does VoIP Work?

The purpose of this section will be to provide a very high-level overview of Voice over IP (VoIP).

Many people have used a computer and a microphone to record a human voice or other sounds. The process involves sampling the sound that is heard by the computer at a very high rate (at least 8,000 times per second or more) and storing those "samples" in memory or in a file on the computer. Each sample of sound is just a very tiny bit of the person's voice or other sound recorded by the computer. The computer has the ability to take all of those samples and play them, so that the listener can hear what was recorded.

VoIP is based on the same idea, but the difference is that the audio samples are not stored locally. Instead, they are sent over the IP network to another device and played there.

Of course, there is much more required in order to make VoIP work. When recording the sound samples, the computer might compress those sounds so that they require less space and will certainly record only a limited frequency range. There are a number of ways to compress audio, the algorithm for which is referred to as a "compressor/de-compressor", or simply CODEC. Many CODECs exist for a variety of applications (e.g., movies and sound recordings) and, for VoIP, the CODECs are optimized for compressing voice, which significantly reduce the bandwidth used compared to an uncompressed audio stream. Speech CODECs are optimized to improve spoken words in the frequency range of human speech.

Once the sound is recorded by the computer and compressed into very small samples, the samples are collected together into larger chunks and placed into data packets for transmission over the IP network. This process is referred to packetization. Generally, a single IP packet will contain 10 or more milliseconds of audio, with 20 or 30 milliseconds being most common.

A good comparison is to think of a packet as a postcard sent via postal mail. A postcard contains just a limited amount of information. To deliver a very long message, one must send a lot of postcards. Of course, the post office might lose one or more postcards. One also has to assemble the received postcards in order, so some kind of mechanism must be used to properly organize the postcards, such as placing a sequence number on the bottom right corner. One can think of data packets in an IP network as postcards.

Packets are sometimes delayed, just as with the postcards sent through the post office. This is particularly problematic for VoIP systems, as delays in delivering a voice packet means the information is too old to play. Such old packets are simply discarded, just as if the packet was never received. This is acceptable to a certain degree, as long as the assembled packets do not distort the sound. Too much delay will cause the sound to have less than desirable quality.

IP Devices generally measure the packet delay and expect the delay to remain relatively constant, though delay can increase and decrease during the course of a conversation. Variation in delay is called jitter. Delay, itself, just means it takes longer for the recorded voice spoken by the first person to be heard by the user on the far end. In general, good networks have an end-to-end delay of less than 100ms, though delay up to 400ms is considered acceptable (especially when using satellite systems). Jitter can result in choppy voice or temporary glitches, so VoIP devices implement jitter buffer algorithms to compensate for jitter. Essentially, this means that a certain number of packets are queued before play-out and the queue length may be increased or decreased over time to reduce the number of discarded, late-arriving packets or to reduce "mouth to ear" delay. Such "adaptive jitter buffer" schemes are also used by a wide variety of devices that deal with variable delay.

Jitter in Packet Voice Networks

Jitter is defined as a variation in the delay of received packets. At the sending side, packets are sent in a continuous stream with the packets spaced evenly apart. Due to network congestion, improper queuing, or configuration errors, this steady stream can become lumpy, or the delay between each packet can vary instead of remaining constant.

This diagram illustrates how a steady stream of packets is handled.

When an IP device receives a Real-Time Protocol (RTP) audio stream for Voice over IP (VoIP), it must compensate for the jitter that is encountered. The mechanism that handles this function is the playout delay buffer. The playout delay buffer must buffer these packets and then play them out in a steady stream to the digital signal processors (DSPs) to be converted back to an analog audio stream. The playout delay buffer is also sometimes referred to as the de-jitter buffer.

This diagram illustrates how jitter is handled.

If the jitter is so large that it causes packets to be received out of the range of this buffer, the out-of-range packets are discarded and dropouts are heard in the audio. For losses as small as one packet, the DSP interpolates what it thinks the audio should be and no problem is audible. When jitter exceeds what the DSP can do to make up for the missing packets, audio problems are heard.

This diagram illustrates how excessive jitter is handled.

Video works in much the same way as voice. Video information received through a camera is broken into small pieces, compressed with a CODEC, placed into small packets, and transmitted over the IP network. This is one reason why VoIP is promising as a new technology: adding video or other media is relatively simple. Of course, there are certain issues that must be considered that are unique to video (e.g., frame refresh and much higher bandwidth requirements), but the basic principles of VoIP equally apply to video telephony.

Of course there is much more to VoIP than just sending the audio/video packets over the Internet. There must also be an agreed protocol for how computers find each other and how information is exchanged in order to allow packets to ultimately flow between the communicating devices. There must also be an agreed format (called payload format) for the contents of the media packets.

VoIP is implemented in a variety of hardware devices, including IP phones, analog terminal adapters (ATAs), and gateways. In short, a large number of devices can enable VoIP communication, some of which allow one to use traditional telephone devices to interface with the IP networks.

In a well performing network, VoIP calls should be as clear or clearer that and other type of audio transmissions. VoIP calls are pure digitized sound. Each audio packet contains the pure audio just exactly as it is spoken into the microphone.

High definition voice contains a wider range of frequencies than typical voice transmissions and will deliver surprisingly good audio that contains a richer sound than most toll quality calls.

VoIP Protocols

There are a number of protocols that may be employed in order to provide for VoIP communication services. In this section, we will focus on SIP since it is the protocol of choice for most devices now being deployed in the industry.

Virtually every device in the world uses a standard called Real-Time Protocol (RTP) for transmitting audio and video packets between communicating computers. RTP is defined by the open standards that are set using various standards documents. RTP also addresses issues like packet order and provides mechanisms (via the Real-Time Control Protocol, to help address delay and jitter.

Before audio or video media can flow between two devices, various protocols must be employed to find the remote device and to negotiate the means by which media will flow between the two devices. The protocols that are central to this process are referred to as call-signaling protocols, the most popular of which is Session Initiation Protocol (SIP).

Advantage of SIP

SIP is a simple protocol (at least for machines). It is a text base protocol that is designed to be easily read and lends itself well to troubleshooting by being able to see what is in the packets without having to completely decompile the software,
SIP works very similar to email. The addressing is very similar.
SIP calls are pure digital voice. In its native world, there is no distortion, no delay and no echo.
Many different devices can interoperate in a network providing a wide variety of choice for users that extends far beyond what any single vendor can provide.
Location of users is irrelevant. As long as they have access to broadband.

Anatomy of a SIP Call

A SIP Call has a signaling component and a Voice component. The signaling path is different from the actual voice transmission path. The signaling and voice transmission are unique for each call. All call setup and teardown signaling is over port 5060. All voice transmissions are on a range of ports 10,000 to 20,000. The ports are virtual ports and are simply part of the protocol for communicating over TCPIP.

Setting the Stage

Before a SIP call can be placed, there needs to be SIP endpoints that have the ability to” find” and “be found by” other SIP endpoints in order to make and receive calls. For the purposes of this training guide, we will limit our SIP examples to endpoints that register to an IPitomy IP PBX. While it is possible to call from one SIP endpoint to another directly using a peer-to-peer method, most calls are facilitated through an IP PBX or soft switch to make dialing simple and easy just like a PSTN call. The big advantage of having an IP PBX is that the user does not have to dial an IP address to call another endpoint and all of the call information can be stored for reporting etc. The IP PBX will handle routing all of the calls to their destination on the PSTN or to local and remote extensions. Users simply dial phone numbers and extension numbers just like a legacy PBX system. An IP PBX supports analog PSTN lines, T1/PRI lines, DID’s and SIP Trunks.

In order to be part of the PBX database, SIP endpoints register with the PBX. Once they are registered, they have the ability to dial phone numbers and be called by other endpoints. The registration can be from phones on the Local Area Network (LAN) or from anywhere on the Internet.

Now that the phones are registered with the PBX, a call can be initiated or received. To start the call, the endpoint sends an invite to the server to ask the other endpoint if it is available to take a call. This takes place on the signaling port (port 5060). If the other endpoint is ready to accept the call, it sends an acknowledgement back to the initiating phone. The initiating phone then sends the call information telling the other phone which ports to commence the RTP (voice) session on. The RTP session is opened using the ports communicated by the initiating call.

When the call is over, one of the endpoints sends a bye message and the call hangs up. That is a pretty simple description of how SIP makes a call. When the call is on the LAN it does not have to go through the router. The PBX will handle telling the endpoints which ports to use to connect the RTP (voice) stream.

When the call is to a remote phone, the PBX knows the phone is outside of the firewall. This is when the router needs to have ports configured for signaling and RTP traffic.

Signaling Port 5060

When the ports are properly configured, port 5060 is forwarded in the router to the PBX systems IP address on the LAN. This allows the PBX to send signals to the remote phones as well as receive requests from them.

RTP Ports 10,000 – 20,000 Port Range Forwarding

Once the call is setup using the signaling on Port 5060, the RTP is setup using a range of ports that are forwarded to the PBX LAN IP address. Using Port Range Forwarding in the Router, the range of Ports 10,000 – 20,000 is forwarded for this purpose.

Each call requires two ports for RTP; one for sending and one for receiving. SIP sets this up from the initiating phone. The ports in the router are open from the inside of the firewall. The phone on the far end receives the information on what ports to use in the SIP packets.

Local Phone Diagram

Network Address Translation – NAT

TCPIP is the protocol for sending data on the Internet. It relies on unique IP addresses in order to get the proper data to the proper computer/device on the network. There are several different types and classes of IP address.

If you are reading this, you are most likely connected to the Internet and there's a very good chance that you are using Network Address Translation (NAT) right now!

The Internet has grown larger than anyone ever imagined it could be. Although the exact size is unknown, A total of 5 billion people around the world use the internet today – equivalent to 63 percent of the world's total population. Internet users continue to grow too, with the latest data indicating that the world's connected population grew by almost 200 million in the 12 months to April 2022.

So what does the size of the Internet have to do with NAT? Everything! For a computer to communicate with other computers and Web servers on the Internet, it must have an IP address. An IP address (IP stands for Internet Protocol) is a unique 32-bit number that identifies the location of your computer on a network. Basically it works just like your street address: a way to find out exactly where you are and deliver information to you.

When IP addressing first came out, everyone thought that there were plenty of addresses to cover any need. Theoretically, you could have 4,294,967,296 unique addresses (232). The actual number of available addresses is smaller (somewhere between 3.2 and 3.3 billion) because of the way that the addresses are separated into Classes and the need to set aside some of the addresses for multicasting, testing or other specific uses.

With the explosion of the Internet and the increase in home networks and business networks, the number of available IP addresses is simply not enough. The obvious solution is to redesign the address format to allow for more possible addresses. This is being developed (IPv6) but will take several years to fully implement because it requires modification of the entire infrastructure of the Internet.

NAT Diagram – One Public IP Address is used by many Devices/Users

Under the current IP addressing scenario (IPv4) there are a finite number of IP addresses available on the Internet. There are not enough IP addresses available for each device to have their own unique IP address. To solve this problem, all routers have the ability to send data to devices through a Network Address Translation (NAT) process. This process allows a group of devices (like PC’s and Phones, etc.) to all share one Internet IP address. This process has stretched out the usefulness of the current IP address scheme until the next numbering scheme (IPv6) is fully deployed.

NAT works by the router passing data to devices because it is aware of the address of the specific devices on the local area network. The information you download to your PC comes directly to your PC because you have a unique internal IP address and a unique MAC ID.

When a device from outside of the local area network, wants to communicate from the Internet to a device on the LAN, it needs a path to guide it to the specific device (like the PBX). In the case of a remote IP phone, when the remote phone wants to make a call, it needs to send some packets to the PBX. In order to do that, the router needs to be instructed on where to send the IP phone packets. When port 5060 is forwarded to the PBX on the LAN, all traffic that comes in on port 5060 gets directed to the PBX.

Once the call is setup, the RTP traffic is directed to ports for sending and receiving. These ports are determined through instructions in the call setup SIP packets. If the port forwarding is not configured properly, the remote phone will not function properly. The symptom most often associated to “one way audio” is almost always caused by improper configuration of the RTP ports in the router. Some routers support Application Layer Gateway(ALG) functionality. While this usually appears to be designed for SIP, it most often interferes with packet delivery and must be turned off.

It is easy to see how the RTP stream can be disrupted if the voice packets cannot reach the proper destination. Sometimes this is caused by the router configuration. Sometimes it can be the inability of the router to properly perform NAT functions. Some routers are simply not capable of NAT and therefore will not work with remote IP phones.

It is essential to be in a position to have port forwarding enabled for remote access for maintenance, remote phones and branch office connectivity. If a third party is in control of the router, it is in everyone’s best interest to have these ports forwarded and the ALG turned off and confirmed before the IP PBX is installed. Failure to have these ports forwarded will result in implementation delays and must be a consideration when proposing a price for the end customer.

IP Telephony over TCPIP using the SIP protocol produces pure digitized sound. There are no functions inside the PBX to add sounds like “static”, echo, hiss or hum. All of these sounds if present are produced in the analog world or are the result of packet loss. In order to troubleshoot issues on a TCPIP packetized network, it is necessary to look for the solutions in the most likely places.

If a customer complains of static, it is most often packet loss in an IP network or an analog entry point like a handset. To identify the source of the problem, first check the analog connections e.g. handset, handset cable etc. Try a known good handset and cord. If that doesn’t solve the problem, run a test for packet loss.

IP Phones are intelligent devices and are not dependent on a circuit. It is easy to simply unplug the phone and plug it into another Ethernet connection. If that fixes the problem, plug a known good phone into the Ethernet connection of the phone that had issues. If a known good phone is plugged in to the Ethernet connection and exhibits the same problem, check the cables and connections for problems. Make sure the Ethernet cables are not draped over fluorescent lights are other devices that can induce distortion into the packet delivery process.

Implementing Quality of Service (QOS) is Critical in Your VoIP Installation

Implementing QOS has huge benefits for your VoIP application. Don’t underestimate the importance of setting this up properly. Proper configuration can save customers from a difficult experience as well as keep your support costs down.

What Does QOS Do?

QOS sets the priority for data packets on your LAN. The LAN has packets from a diverse set of applications all traveling through a limited amount of bandwidth. Voice occupies a very small portion of the bandwidth. Since the voice packets are delivered in a time sensitive manner, it is important that they do not get interrupted or delayed. If they do, the audio quality on the call can deteriorate to a noticeable degree.

Voice Packets vs. Regular Data Packets

Voice Packets are distinguished from other data packets by a designation in the voice packet Header. This allows the data switch to know how to prioritize the individual packets to avoid delaying voice packets. Networks always try to deliver data on a best efforts basis. If there is bandwidth available, the data switch will try to pass all of the packets through as soon as it gets them using all of the available bandwidth. If this happens, the voice packets can be momentarily blocked by all of the other data. Even though this may only take a few seconds, it is enough of a delay to cause the phone call to experience audio interruptions as packets are delivered too late to be able to be used. By prioritizing the voice packets, you insure that the voice will never be interrupted. Since the voice is a very small percentage of total bandwidth, there is no noticeable effect on all of the other data packets.

An example would be that 10 people on the LAN are trying to download a 20 meg file at the same time. In a normal 100 base T network that could completely block all data traffic for a brief time. By prioritizing the voice packets to always take priority over the data packets, the voice is delivered without delay because the downloaded file makes room for the voice packets with little or no perceptible delay to the downloads.

How Do I Set up QOS?

QOS is set up in the data switch. The IPitomy server will have settings that it uses to identify the data packets. These settings are set to CS3 by default. The data switch will need to be configured to give the highest possible priority to these data packets. Switches use a variety of QOS labels so you will have to determine the scheme (specification) of the switch to be used. Since IPitomy uses the DSCP Class label, just match that label in the switch to the switch’s highest priority (this may be a digit or “Highest” as in the Netgear FS728TP). It’s important to know that no other devices on the network are utilizing that Class ID. If there are, change them or the IPitomy PBX under PBX Setup/SIP/Advanced. The Class ID used for voice traffic must not be used by other, non-voice data devices.

Note: QOS can only be set on the LAN [in the data switch(es)], it is not relevant on the WAN (Internet) since this media is routed by “hops” for which you have no control. The exception to this is private WAN’s like MPLS where the network provider may be able to configure QOS point-to-point.

VOIP (RTP) works best when QOS is set on the LAN. As a rule, always implement QOS.

For more information on setting up QOS, see the article in IPitomy’s WIKI.

Training:How does VOIP Work

Contents