Codecs are used to compress video to reduce the bandwidth required to transport streams, or to reduce the storage space required to archive them. The price for this compression is increased computational requirements: The higher the compression ratio, the more computational power is required.
Fixing the tradeoff between bandwidth and computational requirements has the effect of defining both the minimum channel bandwidth required to carry the encoded stream and the minimum specification of the decoding device. In traditional video systems such as broadcast television, the minimum specification of a decoder (in this case a set-top box) is readily defined.
Today, however, video is used in increasingly diverse applications with a correspondingly diverse set of client devices—from computers viewing Internet video to portable digital assistants (PDAs) and even the humble cell phone. The video streams for these devices are necessarily different.
To be made more compatible with a specific viewing device and channel bandwidth, the video stream must be encoded many times with different settings. Each combination of settings must yield a stream that targets the bandwidth of the channel carrying the stream to the consumer as well as the decode capability of the viewing device. If the original uncompressed stream is unavailable, the encoded stream must first be decoded and then re-encoded with the new settings. This quickly becomes prohibitively expensive.
In an ideal scenario, the video would be encoded only once with a high efficiency codec. The resulting stream would, when decoded, yield the full resolution video. Furthermore, in this ideal scenario, if a lower resolution or bandwidth stream was needed to reach further into the network to target a lower performance device, a small portion of the encoded stream would be sent without any additional processing. This smaller stream would be easier to decode and yield lower resolution video. In this way, the encoded video stream could adapt itself to both the bandwidth of the channel it was required to travel through and to the capabilities of the target device. These are exactly the qualities of a scalable video codec.
H.264 SVC
The Scalable Video Codec extension to the H.264 standard (H.264 SVC) is designed to deliver the benefits described in the preceding ideal scenario. It is based on the H.264 Advanced Video Codec standard (H.264 AVC) and heavily leverages the tools and concepts of the original codec. The encoded stream it generates, however, is scalable: temporally, spatially, and in terms of video quality. That is, it can yield decoded video at different frame rates, resolutions, or quality levels.
The SVC extension introduces a notion not present in the original H.264 AVC codec−that of layers within the encoded stream. A base layer encodes the lowest temporal, spatial, and quality representation of the video stream. Enhancement layers encode additional information that, using the base layer as a starting point, can be used to reconstruct higher quality, resolution, or temporal versions of the video during the decode process. By decoding the base layer and only the subsequent enhancement layers required, a decoder can produce a video stream with certain desired characteristics. Figure 1 shows the layered structure of an H.264 SVC stream. During the encode process, care is taken to encode a particular layer using reference only to lower level layers. In this way, the encoded stream can be truncated at any arbitrary point and still remain a valid, decodable stream.
(Click to enlarge)
Figure 1. The H.264 SVC Layered Structure.
This layered approach allows the generation of an encoded stream that can be truncated to limit the bandwidth consumed or the decode computational requirements. The truncation process consists simply of extracting the required layers from the encoded video stream with no additional processing on the stream itself. The process can even be performed "in the network". That is, as the video stream transitions from a high bandwidth to a lower bandwidth network (for example, from an Ethernet network to a handheld through a WiFi link), it could be parsed to size the stream for the available bandwidth. In the above example, the stream could be sized for the bandwidth of the wireless link and the decode capabilities of the handheld decoder. Figure 2 shows such an example as a PC forwards a low bandwidth instance of a stream to a mobile device.
Figure 2. Parsing Levels to Reduce Bandwidth and Resolution.
H.264 SVC Under the Hood
To achieve temporal scalability, H.264 SVC links its reference and predicted frames somewhat differently than conventional H.264 AVC encoders. Instead of the traditional Intra-frame (I frame), Bidirectional (B frame) and Predicted (P frame) relationship shown in Figure 3, SVC uses a hierarchical prediction structure.
(Click to enlarge)
Figure 3. Traditional I, P, and B Frame Relationship.
The hierarchical structure defines the temporal layering of the final stream. Figure 4 illustrates a potential hierarchical structure. In this particular example, frames are only predicted from frames that occur earlier in time. This ensures that the structure exhibits not only temporal scalability but also low latency.
(Click to enlarge)
Figure 4. Hierarchical Predicted Frames in SVC.
This scheme has four nested temporal layers: T0 (the base layer), T1, T2, and T3. Frames constituting the T1 and T2 layers are only predicted from frames in the T0 layer. Frames in the T3 layer are only predicted from frames in the T1 or T2 layers.
To play the encoded stream at 3.75 frames per second (fps), only the frames that constitute T0 need be decoded. All other frames can be discarded. To play the stream at 7.5 fps, the layers making up T0 and T1 are decoded. Frames in layers T2 and T3 can be discarded. Similarly, if frames that constitute T0, T1 and T2 are decoded, the resulting stream will play at 15 fps. If all frames are decoded, the full 30 fps stream is recovered.
By contrast, in H.264 AVC (for Baseline Profile where only unidirectional predicted frames are used), all the frames would need to be decoded irrespective of the desired display rate. To transit to a low bandwidth network, the entire stream would need to be decoded, the unwanted frames discarded, and then re-encoded.
Spatial scalability in H.264 SVC follows a similar principle. In this case, lower resolution frames are encoded as the base layer. Decoded and up-sampled base layer frames are used in the prediction of higher-order layers. Additional information required to reconstitute the detail of the original scene is encoded as a self-contained enhancement layer. In some cases, reusing motion information can further increase encoding efficiency.
Simulcast vs. SVC
There is an overhead associated with the scalability inherent in H.264 SVC. As can be seen in Figure 3, the distance between reference and predicted frames can be longer in time (from T0 to T1 for example) than with the conventional frame structure. In scenes with high motion, this can lead to slightly less efficient compression. There is also an overhead associated with the management of the layered structure in the stream.
Overall, an SVC stream containing three layers of temporal scalability and three layers of spatial scalability might be twenty percent larger than an equivalent H.264 AVC stream of full resolution and full frame rate video with no scalability. If scalability is to be emulated with the H.264 AVC codec, multiple encode streams are required, resulting in a dramatically higher bandwidth requirement or expensive decoding and re-encoding throughout the network.
Additional SVC Benefits
Error Resilience:
Error resilience is traditionally achieved by adding additional information to the stream so errors can be detected and corrected. SVC's layered approach means that a higher level of error detection and correction can be performed on the smaller base layer without adding significant overhead. Applying the same degree of error detection and correction to an AVC stream would require that the entire stream be protected, resulting in a much larger stream. If errors are detected in the SVC stream, the resolution and frame rate can be progressively degraded until, if needed, only the highly protected base layer is used. In this way, degradation under noisy conditions is much more graceful than with H.264 AVC.
Storage Management:
Since an SVC stream or file remains decodable even when truncated, SVC can be employed both during transmission and after the file is stored. Parsing files stored to disk and removing enhancement layers allows file size to be reduced without additional processing on the video stream stored within the file. This would not be possible with an AVC file which necessitates an "all or nothing" approach to disk management.
Content Management:
The SVC stream or file inherently contains lower resolution and frame rate streams. These streams can be used to accelerate the application of video analytics or for cataloging algorithms. The temporal scalability also makes the stream easier to search in fast forward or reverse.