July 14, 2020

Under the Hood of Video Apps That Bring Home Meetings, Movies and More

Srini Rajam (Chairman & CEO, Ittiam)

In the new norm we are in, video is ubiquitous in our homes from office desk, living room to our game room, all within a few feet away from each other. Virtually every call we make or receive can be turned into a video call. Every program on TV can be viewed later on a different screen and every training program is available as an online video session. Suddenly and more so refreshingly, it’s a brand new world where numerous video apps surround us with an intent to assist, entertain, and enlighten. How would they look under the hood; similar or different? What are the technologies that enable smooth interaction? What more can we expect in future in this exciting space? These questions are intriguing!

Every Video Application Presents Unique Technology Demands

From a video technology viewpoint, for an application, one size doesn’t fit all. Each app has its own purpose. It requires all the enabling technologies to be lined up exactly in a particular fashion to maximize the end user experience for the respective app. The differing demands of applications can be understood from a couple of illustrations below.

Interactivity: When a user is not merely watching a video being streamed in but is also interacting with others through the video, such as in a video meeting organized over Microsoft Teams or Zoom, the app must be very agile. This is required to facilitate a natural dialog without any lag or “latency”. The video and systems technology required to fulfill this requirement has a distinct flavor and methods associated with it.

Complexity: The challenges in video technology depend on key factors such as resolution, frame rate, content security (rights management) and content diversity. High definition videos require more processing power, transmission bandwidth and storage. Likewise, videos involving fast sports action, such as a Twitch game stream with a rich mix of video, audio, text, and graphics require highly sophisticated technology support.

An overview of technology requirements of video applications and the solution stack is shown in Fig. 1. Many of the key elements therein are explained in this article as you read on.

The importance of audio technology in a video application cannot be emphasized enough due to its huge impact on the user experience. As we know, when a video is finally delivered, audio becomes integral to it. This subject, however, would require a dedicated focus in order to do full justice to its scope. Therefore, I have refrained from explicitly covering it in this article.

Video Compression, the Engine for a Video Application

If video app is the car, video compression is its engine. It not only becomes quite essential but also greatly influences the eventual functionality of the app. Video compression techniques reduce the original video by orders of magnitude in size, ~500-1000 times, while retaining the video quality. Without such compression, it would not be feasible to imagine a full length movie stored on a DVD or a video clip transmitted over cellular network to our smartphones.

The Motion Pictures Expert Group (MPEG) standards committee working under the aegis of ISO has contributed several popular standards. Its second generation standard MPEG2 was used in DVDs and in early satellite digital video broadcasts. The most actively deployed MPEG standard today is H.264 or AVC (Advanced Video Coding) which was initially approved in 2003. Since the time of H.264, MPEG has released H.265 or HEVC (High Efficiency Video Coding). It is on the verge of launching the next generation H.266 or VVC (Versatile Video Coding) standard.

The Alliance for Open Media (AOM) founded in 2015 with industry leading companies in online and internet media, has the charter to deliver advanced video compression standards that are royalty free and open source. AOM’s AV1 standard has strong traction in leading online video services. The body is now working on the next generation AV2 standard. The framework for AV1 was derived from the VP9 standard developed by Google.

Video compression technologies have evolved over successive generation of standards as seen above. Typically, a new generation standard achieves increased compression efficiency of 35-50% over its predecessor for the same level of video quality. For example, on a 2MBPS broadband line to home, if a movie encoded with H.264 standard was streamed at 720p resolution, it can be streamed at 1080p resolution when encoded using H.265.

Video Codec – the Pair of Encoder and Decoder

A video compression implemented as per any of the standards described above, consists of two distinct units – Encoder which performs the compression and Decoder which performs the decompression. The pair together is called a Codec. The encoder requires many times the computational power needed for the decoder, since it needs to incorporate several complex modules to ensure the target video quality during compression at the lowest bitrate possible. Higher the target video quality, greater the computational effort! On the other hand, a decoder needs to run on very diverse set of clients, such as a phone, laptop, smart TV or gaming console and perform a function while utilizing the minimum amount of processing resources available on the client. Since many of the clients, such as phones, run on batteries, the power efficiency of a decoder has direct impact on the time between recharges.

An Encoder and a Decoder are commonly implemented in software running on a range of System on Chips (SOCs), such as those offered by Intel, Nvidia and Qualcomm. It is to be noted that some of the SOCs integrate a hardwired implementation of the codec, either in full form or for significant blocks. Typically, the codecs of older generation standards are available in hardwired form as they remain matured and stabilized, while codecs of newer generation standards are available in software form to facilitate an evolving nature of the technology.

Both MPEG and AOM have member companies from multiple domains including Semiconductor, Software and Systems. Ittiam participates and contributes to both the forums.

Enhancing the Video Codec Platform for Application Specific Needs

As noted earlier, every application presents its own unique technology demands, be it low latency or rich media. A Video Codec Platform, in addition to having the basic encoder-decoder pair, encompasses several useful functions that enable efficient and fast creation of an end application, as illustrated in Fig. 2. Companies like Ittiam, which focus on core video technology, specialize in providing a comprehensive platform.

Full Solution Stack for Building a Video Application

In addition to a high-performance and versatile core video codec platform, the following two major components help complete the full solution stack in building a video application. The stack is broadest at the bottom and becomes exclusively application focused at the top, as depicted in Fig. 1.

Cloud Infrastructure: This consists of reconfigurable frameworks supporting many aspects of online application workflow. Extensive open source technologies are available in this space, from players such as Google Compute, Microsoft Azure and Amazon AWS, to accelerate the product development.

User Experience: As we know, the ultimate success of an app depends heavily on the end user experience. Application specific design and user centric technologies have advanced by leaps and bounds to make this space a very specialized and sophisticated one.

AI is at the Heart of Video Futures

Artificial Intelligence is the future of virtually every field and video is no different. While numerous possibilities are being researched and developed, the following two prominent areas give a glimpse of immediate future.

Deep Learning: With the help of neural networks and deep learning, the core design of a video codec can be improved enormously by making it adapt to the type of content being encoded (see reference to CAE in Fig. 1 and Fig. 2). In this approach, an encoder “understands” the nature of video that is being processed and accordingly allocates the minimum required (less or more) bits to encode the video for a desired quality. The THINKode product from Ittiam performs Content Adaptive Encoding using Deep Learning techniques.

Computer Vision: At the application level, Computer Vision capabilities will greatly enhance the user experience by automatically customizing the participants’ backgrounds in video conferencing. It provides cues for improved interaction and vision-based indexing, thereby, summarizing the meeting sessions.

In conclusion, video experiences of today are set to become more pervasive with applications that span facets of our lives such as work, family, friend circles, education and entertainment. Diverse and challenging demands of these applications place core video platform companies like Ittiam at the center of innovation and value creation. Artificial Intelligence will play a dominant role in creating applications with use cases that we have not even imagined yet.