Performance of Host-Based Media Processing
The viability of host-based software in providing media processing capabilities continues to expand the possibilities for voice application development. Over time, we will see a larger percentage of applications deployed on host vs. board-based systems. This trend is driven primarily by the relentless increase in processor speed and capabilities. Host-based media processing is also a natural fit for VoIP-based applications, since these applications can now be delivered as software-only solutions.
Vendors of host-based media processing implementations face a number of technical challenges in providing a viable solution to telephony application developers. The main challenges are:
- System Capacity: optimizing media processing algorithms to take advantage of the computing power now available on standard desktop platforms.
- Robustness: preventing other applications running on the platform and the media processing from interfering with each other when competing for CPU resources.
- Latency: ensuring that the latency introduced by the media processing does not affect the perceived quality of the audio signal being processed.
We discuss each of these challenges in detail below and how PIKA’s host-based media processing implementation meets each of these challenges.
Defining System Capacity
The capacity of a system is the maximum number of active media processing channels (such as play, record, DTMF detection, and echo cancellation) that can be supported by the application on a platform. The greater the capacity of a single platform, the lower the cost per port for an application and the greater the value the application provides to a customer. To increase the capacity of an application, the media processing must be as efficient as possible.
The factors that affect the system capacity of host-based media processing applications on a platform are:
- the acceptable percentage of CPU capacity that can be dedicated to media processing
- the types of media processing that are performed on each active channel
- the specifications (CPU speed and architecture, operating system, amount of RAM, cache size, and NIC speed) of the platform running the application
PIKA has performed extensive performance tests with a variety of media processing applications executed on a wide range of platform configurations. Figure 1 lists a representative sample of applications that PIKA has benchmarked and the media processing that was performed on each channel for the duration of the test.
Application |
Media Processing Performed |
Gateway | VoIP interface (using G.711 or G.729 Codec) Echo cancellation (12ms tail length) PSTN interface |
IVR | VoIP interface (using G.711 or G.729 Codec) or PSTN interface Play for all channels for 70% of the call duration DTMF detection |
Conference | VoIP interface (using G.711 or G.729 Codec) or PSTN interface Conference DTMF detection |
Figure 1: Benchmark Applications Media Processing Functions
High-Level API
VoIP Density
This test measures CPU usage for the specified number of simultaneous RTP channels. The CPU usage was measured after all calls were set up. The test initiator places an outgoing call to the test receiver system. The receiving system answers the call and places an outgoing call back to the test initiator. DTMF detection and tone detection are enabled throughout the call. For the duration of the call, the test initiator plays an announcement on the outgoing channel and records the announcement on the incoming channel. Call duration is 5 minutes. The audio quality of the recording is verified through PESQ scoring.
The following table lists supported channels for the following PCs:
- Test initiator (play/record): Intel Core 2 Duo E6600, 2.4Mhz, 2 GB RAM, 4M Cache
- Test receiver (switching): Intel Core 2 Duo E6600, 2.4Mhz, 2 GB RAM, 4M Cache
Linux (CentOS 5.4)
Application | Channels | Average CPU% |
G.711a Density (play / record) | 400 | 43 |
G.711a Density (switching) | 600 | 47 |
G.729 Density (play / record) | 160 | 56 |
G.729 Density (switching) | 160 | 51 |
ISDN Density
The test initiator places an outgoing call to the test receiver system. The receiving system answers the call and places an outgoing call back to the test initiator. DTMF detection and tone detection are enabled throughout the call. For the duration of the call, the test initiator plays an announcement on the outgoing channel and records the announcement on the incoming channel. Call duration is 5 minutes. The audio quality of the recording is verified through PESQ scoring.
The following table lists supported channels for the following PCs:
- Test initiator (play/record) : Intel Core 2 Duo E6400 processor, 2.13 GHz, with 1 GB RAM
- Test receiver (switching): AMD Athlon64 X2 4800+ AM2 Dual-Core 65W Processor, 1 GM RAM
Linux (CentOS 5.4)
Application | Channels | Average CPU% |
E1 ISDN Density (play / record) | 480 | 27 |
E1 ISDN Density (switching) | 600 | 31 |
Windows
Application | Channels | Average CPU% |
E1 ISDN Density (play / record) | 480 | 28 |
E1 ISDN Density (switching) | 600 | 32 |
Gateway and IVR
- IVR
- A TDM channel (ISDN E1) answers the incoming call. DTMF detection and tone detection are enabled
throughout the call. All channels perform audio playback for the duration of the call. Call duration is 30
seconds.The audio quality of the recording is verified through PESQ scoring.
- A TDM channel (ISDN E1) answers the incoming call. DTMF detection and tone detection are enabled
- Gateway
- A TDM channel (ISDN E1) answers the incoming call. Based on the calling number, a SIP call is made and the
underlying RTP channel (G.711) is connected to the TDM channel. DTMF detection and tone detection are
enabled throughout the call. Echo cancellation is applied to all TDM channels (64ms tail length). For the duration of the call, an announcement is played on the incoming channel and recorded on the outgoing channel. Call duration is 5 minutes. The audio quality of the recording is verified through PESQ scoring.
- A TDM channel (ISDN E1) answers the incoming call. Based on the calling number, a SIP call is made and the
The following table lists supported channels for the following PC:
- Intel Core 2 Duo E6400 processor, 2.13 GHz, with 1 GB RAM
Linux (CentOS 5.4)
Application | Channels | Average CPU% |
ISDN E1 IVR | 160 | 24 |
ISDN E1 to G.711a Gateway | 160 | 57 |
Note: Perceptual Evaluation of Speech Quality (PESQ) is a standard for evaluating audio quality in telephony
applications. It is standardized as ITU-T recommendation P.862 (02/01).
Low-Level API
The following table lists supported channels for the following PC at 60% CPU utilization: Intel Quad Core Xeon
Processor E53452, 2.33 GHz, 2 GB RAM, 4 MB L2 Cache, 1 GB NIC.
Application | Number of Channels (Windows) |
Number of Channels (SUSE 10.2) |
G.711 Gateway | 400 | 480 |
G.729 Gateway | 180 | 300 |
G.711 Channel | 1200 | 1400 |
G.729 Channel | 320 | 350 |
V.17 Fax | 480 | 600 |
G.711 Advanced Conference | 400 | 250 |
TDM Advanced Conference | 360 | 400 |
Mixed G.711 / G.729 IVR | 320 | 400 |
The following table lists supported channels for the following PC running at 60% CPU utilization: Intel Dual Xeon Nocona, 2×3.0 GHz, 1 GB RAM, 2 MB L2 Cache, 1 GB NIC.
Application | Number of Channels (Windows) |
Number of Channels (SUSE 10.2) |
G.711 Gateway | 260 | 260 |
G.729 Gateway | 120 | 120 |
G.711 Channel | 800 | 850 |
G.729 Channel | 160 | 160 |
V.17 Fax | 360 | 400 |
G.711 Advanced Conference | 250 | 250 |
TDM Advanced Conference | 240 | 240 |
Details on Applications
- G.711 Gateway
- A TDM channel (ISDN E1) answers the incoming call. Based on the calling number, a SIP call is made and the underlying RTP channel (G.711) is connected to the TDM channel. Echo cancellation is applied to all TDM channels (12ms tail length). Call duration is 180 seconds.
- G.729 Gateway
- A TDM channel (ISDN E1) answers the incoming call. Based on the calling number, a SIP call is made and the underlying RTP channel (G.729) is connected to the TDM channel. Echo cancellation is applied to all TDM channels (12ms tail length). Call duration is 180 seconds.
- G.711 Channel
- An RTP channel (SIP G.711) answers the incoming call. DTMF detection is turned on throughout the call and an announcement is played to the call originator for 70% of the call duration. Call duration is 30 seconds.
- G.729 Channel
- An RTP channel (SIP G.729) answers the incoming call. DTMF detection is turned on throughout the call and an announcement is played to the call originator for 70% of the call duration. Call duration is 30 seconds.
- V.17 Fax
- A V.17 channel (ISDN E1) answers the incoming call. FAX reception is turned on and a fax is received. Call duration is 60 seconds.
- G.711 Advanced Conference
- An RTP channel (SIP G.711) answers the incoming call and connects it to an available 3-party advanced conference. Digit detection and echo cancellation (12ms tail length) are enabled. The conference is configured for two active talkers with automatic gain control and digit clamping enabled.
- TDM Advanced Conference
- A TDM channel (ISDN E1) answers the incoming call and connects it to an available 3-party advanced conference. Digit detection and echo cancellation (12ms tail length) are enabled. The conference is configured for two active talkers with automatic gain control and digit clamping enabled.
- Mixed G.711/G.729 IVR
- An RTP channel (SIP 25% G.729, 75% G.711) answers the incoming call. Tone detection, speech detection, and DTMF detection are turned on throughout the call. One quarter of the channels are performing audio playback and record.
Robustness of Host-Based Processing
When implementing a host-based media processing solution there are two key robustness objectives:
1. Host-based media processing must receive sufficient CPU capacity to perform all the required functions in real-time.
2. Other applications running on the same platform must regularly receive sufficient CPU capacity to perform all their required activities without noticeable deterioration of performance or noticeable pauses in execution.
Ideally, the CPU would be partitioned so that the host-based media processing receives a specific proportion of the CPU capacity and other applications receive the rest. For example, on a 3.2 GHz processor, the host-based media processing could be guaranteed 25% of the CPU capacity, and other applications would perform as if they were running on a dedicated 2.4 GHz platform.
Why is CPU partitioning important? If there is not strong partitioning between host-based media processing and application CPU utilization, a number of problems arise. If the host-based media processing does not receive sufficient CPU capacity, on a regular basis, it is not able to process all the media in real-time. The quality of the audio will deteriorate and sections of the audio may be dropped, causing distortion and choppiness in the audio signal. If other applications do not receive sufficient CPU capacity on a regular basis, their performance becomes slow, choppy, and non-responsive. In extreme cases, the platform may not respond to mouse movements or keystrokes.
Windows and Linux are not real-time operating systems and, as such, are not designed to easily partition CPU usage. They have no built-in mechanisms to ensure that a process does not monopolize the CPU to the detriment of other processes.
PIKA’s solution is a real-time microkernel that acts as a firewall between applications and media processing. The microkernel allows non-real-time operating systems to serve the real-time demands of processing voice media without allowing the processor to be monopolized. The PIKA host-based media processing implementation partitions the available CPU capacity so that in every tick, the host-based media processing function execution is guaranteed to be allotted sufficient CPU capacity in real-time while ensuring that other processes running on the same platform receive sufficient CPU capacity to run smoothly and provide good performance.
PIKA’s micro-kernel is designed to work on single processor platforms and to balance the load for each CPU on hyper-threading or multiple processors platforms.
To test PIKA’s microkernel and host-based media processing, a number of formal and informal tests were executed in conjunction with media processing activity. No combination of commercial applications or CPU load had any effect on the quality of the media processed audio.
PIKA performed the following formal tests, on both Windows and Linux operating systems, to verify the robustness of its microkernel:
- Test the interference from and with user-level processes (normal applications).
- Test the interference from and with kernel-level processes (device drivers, such as NIC drivers, and disk drivers).
- Test competition for PCI bus resources.
To test interference between media processing and user level processes:
- A media processing application was set up to provide the continuous playing of a recorded message to a large conference with 120 members.
- A phone was used to connect to the conference to monitor the audio quality.
- A base-line measure of the CPU utilization was taken.
- A normal-priority user application was executed that consumes a continually increasing percentage of CPU capacity.
- As the user application consumes more and more CPU capacity the quality of the audio recording received from the conference was monitored.
Results: There was no change to the quality of the audio, even when the user application saturated the CPU utilization at 100%.
To test interference between media processing and kernel level processes, a similar test was performed, only this time the test process consumed CPU capacity at the kernel level.
Results: The results were identical to the previous test with no change to the quality of the audio, even when the kernel application saturated the CPU utilization at 100%.
To test competition for PCI bus resources, a Vmetro board was installed on the test platform. This board flooded the PCI bus with data, causing congestion on the PCI bus.
Results: Again, there was no change to the quality of the audio being received from the conference.
Finally, informal tests were performed that more accurately simulated real application processing. The following tasks were performed on the test platform while a conferencing application was executing:
-
Search for the word “cow” in all files on the hard drive. This is a disk I/O and CPU intensive application that generates a large number of interrupts.
-
Copy a large file to a network drive. This function causes a large number of interrupts and network traffic.
-
Play an MP3 streaming audio file. This function causes interrupts from the audio card. The quality of the audio heard is very sensitive to the application being starved for CPU capacity.
-
Compile source code. Compiling code is a CPU intensive activity.
With these functions running, the recording played to the conference was monitored.
Results: As with the other tests, there was no change to the quality of the audio being received from the conference. There was also no deterioration in the quality of the streaming audio file being played.
We can see from the above testing that PIKA’s microkernel has succeeded in partitioning the CPU and in isolating the media processing and other applications running on the same platform.
Measuring Latency
In simple terms, latency is the length of time from when you say something until the person on the other end of the phone line hears what you said. The perceived quality of the audio heard in a call is highly dependent on latency. Typically, the latency for PSTN switching, including PIKA’s AllOnHost TDM switching, and networks is very low, on the order of 5 ms. Developers of pure PSTN applications are rarely concerned with latency. On the other hand, elements of a VoIP network (phones, switches, and gateways) generally add significant latency to the audio path; therefore, applications using VoIP must be must be aware of the total latency of the audio path to ensure that the latency does not exceed the acceptable limits.
How much latency is acceptable? There are two classes of applications that should be considered when determining the acceptable amount of latency; terminating applications such as IVRs, and switching applications such as PBXs, conference bridges, and gateways.
Terminating applications typically imply human interaction with a computer. The application records audio data, plays announcements, and detects DTMF tones or speech generated by the caller. Studies have shown that as long as the application responds within 500 ms, the caller will perceive a good quality connection.
Switching applications typically imply human-to-human interaction. The ITU-T specification G.114 defines three audio quality regions for latency in human-to-human call.
Latency Range (ms) | Audio Quality |
0 to 150 ms | Acceptable for most applications |
150 to 400 ms | Marginally acceptable – impacts the quality of application |
Above 400 ms | Unacceptable |
Figure 5: G.114 Latency Guidelines for Switching Applications
PIKA measured the latency of several connection types using its AllOnHost (host-based) media processing. The result of these measurements is shown in Figure 6. For comparison, the latency measured between two good quality IP phones is also listed.
Equipment Configuration | Measured Latency Range (ms) |
TDM switching using AllOnHost – Analog phone to analog phone | <5 |
Good IP phone directly to good IP phone | 50 to 60 |
IP Gateway using AllOnHost – Analog phone to IP phone | 50 to 60 |
Mixed conferencing using AllOnHost – Analog phone to IP phone | 50 to 60 |
IP conferencing using AllOnHost – IP phone to IP phone | 105 to 120 |
IP transcoding using AllOnHost – G.711 IP phone to G.729 IP phone | 105 to 120 |
Figure 6: AllOnHost Latency Measurement
These values are valid for both G.711 and G.729 codecs. All tests used 20 ms packets and good quality IP phones. The measurements were performed using a switched LAN and locally-connected IP and analog phones.
Note that the latency of the AllOnHost IP gateway is identical to the latency measured between two good quality IP phones and that the latency added by the AllOnHost IP gateway is the same as the latency of a hardware-based IP gateway.
For some applications, the network latency must also be considered when determining the overall latency the callers experience. Figure 7 lists the range of latencies that can be expected, as well as the typical latency for a number of network distances. For comparison, values for PSTN latency for different distances are also given.
PSTN Network | Expected Latency Range (m) |
Typical Latency (ms) |
Local | 0.5 to 4 | 2 |
National long distance | 2 to 70 | 12 |
International long distance (excluding satellite) | 2 to 150 | 20 |
IP Network | ||
Switched LAN | 0.1 to 2 | <1 |
Metropolitan WAN | 2 to 50 | 20 |
National WAN | 2 to 150 | 50 |
Figure 7: Network Connection Latency
To determine the expected latency of an application, take the equipment latency measured for that type of application from Figure 6 and add in the network latency, from Figure 7, that will be encountered by the audio signal as it passes from the speaker to the listener. To demonstrate this, consider the following examples, an IP PBX and a conference server.
IP PBX – In this application, shown in Figure 8, the IP PBX provides connectivity between the PSTN network and local and remote IP phones.
When determining the latency for this application, the latency for three different types of connection must be considered:
• Between PSTN and local IP phones (such as A and B)
• Between PSTN and remote IP phones (such as A and C)
• Between local and remote IP phones (such as B and C)
Figure 9 lists the connection type, the connection latency, the networks traversed, the network latency, and the total latency for each type of connection.
The latency for each type of connection is within the 150 ms limit although the connections to the remote IP phone are at the upper end of the acceptable range. Any additional latency introduced by the host-based media processing would cause the perceived quality of the connection to deteriorate.
Figure 8: IP PBX Application Architecture
Between Phones | Equipment Configuration | Equipment Latency (ms) | Networks Traversed | Network Latency (ms) | Total Latency (ms) |
PSTN and Local IP | IP Gateway | 60 | PSTN-LAN | 2+1 | 63 |
PSTN and Remote IP | IP Gateway | 60 | PSTN-LAN-WAN -LAN | 2+1+50+1 | 114 |
Local and Remote IP | IP Phone to IP Phone | 60 | LAN-WAN -LAN | 1+50+1 | 112 |
Figure 9: IP PBX Latency Estimates
Conference Server – In this application, shown in Figure 10, the conference server provides connectivity between phones from the PSTN network and local IP phones.
When determining the latency for this application, the latency for three different types of connections must be considered:
• Between PSTN phones (such as A and B)
• Between PSTN and IP phones (such as A and C)
• Between IP phones (such as C and D)
Figure 11 lists the connection type, the connection latency, the networks traversed, the network latency, and the total latency for each type of connection.
The latency for each type of connection is within the 150 ms limit although the connection between IP phones is at the upper end of the acceptable range. Any additional latency introduced by the host-based media processing would cause the perceived quality of the connection to deteriorate.
Figure 10: Conference Server Application Architecture
Between Phones | Equipment Configuration | Equipment Latency (ms) | Network Latency (ms) | Total Latency (ms) | |
PSTN | PSTN conferencing | 4 | PSTN-PSTN | 2+2 | 8 |
PSTN and IP | Mixed conferencing | 60 | LAN-LAN | 1+1 | 62 |
IP | IP conferencing | 120 | LAN-LAN | 1+1 | 122 |
Figure 11: Conference Server Latency Estimate
Summary: Overcoming the Challenges
There are significant challenges to implementing a host-based media processing solution. Each of these challenges must be overcome to produce a viable software-only telephony application. PIKA’s microkernel solution ensures that:
- The capacity of applications based on PIKA’s AllOnHost (host-based) media processing can accommodate up to 600 active channels. This is sufficient capacity for cost-effective small-sized to medium-sized applications.
- There is no interference between the AllOnHost media processing and other applications executing on the platform. The media processing is guaranteed to receive sufficient CPU capacity to perform the required functions in real time.
- The latency added by the host-based media processing is small enough that applications can achieve latency below the 150 ms required for good quality human-to-human conversation.