Monday 31 December 2012

Video Compression Basics

In olden days video transmission and storage was in analog domain. Popular analog transmission standards were NTSC, PAL and SECAM. Video tapes were used as a storage media and VHS and Betamax standards were adopted. Later video transmission and storage moved towards digital domain. Digital signals are immune to noise and power requirement to transmit is less compared to analog signals. But they require more bandwidth to transmit them.  In communication engineering Power and Bandwidth are scarce commodities.  Compression technique is employed to reduce bandwidth requirement by removing the redundancy present in the digital signals. From mathematical point of view it decorrelates the data. Following case study will highlight the need for compression. One second digitized NTSC video requires data rate of 165 Mbits/sec. A 90 minute uncompressed NTSC video will generate 110 Giga bytes. [1]. Around 23 DVDs are required to hold this huge data. But one would have come across DVD’s that contain four 90 minute movies. This is possible because of efficient video compression techniques only.

Television (TV) signals are combination of video, audio and synchronization signals. General public when they refer video they actually mean TV signals. In technical literature TV signals and video are different. If 30 still images (Assume each image is slightly different from next image) are shown within a second then it will create an illusion of motion in the eyes of observer. This phenomenon is called 'persistance of vision’. In video technology still image is called as frame. Eight frames are sufficient to show illusion of motion but 24 frames are required to create a smooth motion as in movies.

Figure 1 Two adjacent frames   (Top)  Temporal redundancy removed  image (Bottom)

Compression can be classified into two broad categories. One is transform coding and another one is statistical coding. In transform coding Discrete Cosine Transform (DCT) and Wavelet Transforms are extensively used in image and video compression. In source coding Huffman coding and Arithmatic coding are extensively used. First transform coding will be applied on digital video signals. Then source coding will be applied on coefficients of transform coded signal. This strategy is common to image and video signals. For further details read [2].

In video compression Intra-frame coding and Inter-frame coding is employed. Intra-frame coding is similar to JPEG coding. Inter-frame coding exploits the redundancy present among adjacent frames. Five to fifteen frames will form Group of Pictures (GOP).  In the figure GOP size is seven and it contains one Intra (I) frame and two Prediction (P) frame and four Bi-directional Prediction (B) frames. In  I frame  spatial redundancy alone  is exploited and it very similar to JPEG compression. In P and B frames both spatial and temporal (time) redundancy is removed. In the figure 1, Temporal redundancy removed image can be seen. In the figure 2, P frames are present in 4th and 7th position. Fourth position P1 frame contains difference between Ith frame and 4th frame. The difference or prediction error is only coded. To regenerate 4th frame, I frame and P1 frame is required.  Like wise 2nd frame uses prediction error between I, P1, and B1 frames. The decoding sequence is I PPBBBB4. (Check with a book)

Figure 2 Group of Pictures (GOP)

One may wonder why GOP is limited to 15 frames. We know presence of more number of P and B frames results in much efficient compression. The flip side is if there is an error in I frame then dependant P and B frames cannot be decoded properly. This results in partially decoded still image (i.e. I frame) shown to viewer for the entire duration of GOP. For 15 frames one may experience still image for half a second. Beyond this duration viewer will be annoyed to look at still image. Increase in GOP frame size increases decoding time. This time will be included in latency calculation. Real-time systems require very minimum latency.

In typical soap opera TV episodes very low scene changes occur within a fixed duration. Let us take two adjacent frames. Objects (like face, car etc) in the first frame would have slightly moved in the second frame. If we know direction and quantum of motion then we can move the first frame objects accordingly to recreate second frame. Idea is simple to comprehend but implementation is very taxing.  Each frame will be divided into number of macroblocks. Each macroblock will contain 16x16 pixels (in JPEG 8x8 pixels are called Block that is why 16x16 pixels are called Macroblock). Choose macroblock one by one in the current frame (in our example, 2nd frame in Figure 1) and find ‘best matching’ macroblock in the reference frame (i.e. first frame in Figure 2). The difference between the best matching macroblock and chosen macroblock is called as motion compensation. The positional difference between two blocks is represented by motion vector. This process of searching best matching macroblock is called as motion estimation [3].


Figure 3 Motion Vector and Macroblocks

       
         A closer look at the first and second frame in the figure 1 will offer following inferences. (1) There is a slight colour difference between first and second frame (2) The pixel located at 3,3 is the first frame is the 0,0 th pixel in the second frame. 
         In figure 3 a small portion of frame is taken and macroblocks are shown. In that there are 16 macroblocks in four rows and four columns.

Group of macroblocks are combined together to form a Slice.

Further Information:  
  • Display systems like TV, Computer Monitor incorporates Additive colour mixing concept. Primary colours are Red, Green and Blue. In printing Subractive colour mixing concept is used and the primary colours are Cyan, Magenta, Yellow and Black (CMYK). 
  • Human eye is more sensitive to brightness variation than colour variation. To exploit this feature YCbCr model is used. Y-> Luminance Cb -> Crominance Blue Cr-> Crominance Red. Please note Crominance Red  ≠  Red
  • To conserve bandwidth analog TV systems uses Vestigial Sideband Modulation a variant of Amplitude Modulation (AM) and incorporate Interlaced Scanning method.

Note: This article is written to make the reader to get some idea about video compression within a short span of time. Article is carefully written but guarantee cannot be given for accuracy. So, please read books and understand the concepts in proper manner.
Sources:
[2]  Salent-Compression-Report.pdf, http://www.salent.co.uk/downloads/Salent-Compression-Report.pdf  (PDF, 1921 KB)
[3]  Iain E. G. Richardson, “H.264 and MPEG-4 Video Compression Video Coding for Next-generation Multimedia,” John Wiley & Sons Ltd, 2003. (Examples are very good.  ISBN 0-470-84837-5)