April 4, 2014

VP9 Video Encoder with Faster Turnaround

by Ranjit Kumar Tulabandu (Principal Engineer, Media Server Technologies)

libvpx is a software video codec library from Google which serves as the reference software implementation for the VP8 and VP9 video coding standards. libvpx is distributed as open source software under a revised BSD License.

VP9 video encoding algorithms, as implemented in libvpx, offer a BD rate improvement of up to 40% over H.264/AVC encoders for typical high quality presets in 2 pass encoding mode. This makes libvpx (and VP9) an attractive candidate for usage in applications such as Over The Top (OTT) video delivery services.

However, in comparison with H.264/AVC encoders, libvpx has lower encoding speeds that can result in longer turnaround times. For example, using version 1.6.0 of libvpx, 2 pass encoding speed of ‘good’ –
cpu-used=1 preset is observed to be up to 2x slower as compared to ‘very slow’ preset of x264 encoder on the same hardware with similar thread configurations. In spite of the bandwidth gains on offer, this gap in performance could create barriers for adoption of VP9 technology.

We recently worked on a project to improve the performance of the libvpx encoder implementation in partnership with Netflix and Google. The overall message was communicated through this press release and this blog provides more details about the efficient multi-threading implementation that has enabled a 50-70% improvement in speed with no loss in quality.

Unlike assembly level optimizations, multi-threading optimizations are architecture agnostic and applicable to any multi-core processor. As part of the improvements, multi-threading was applied to the following three blocks of libvpx which suffered from poor performance in 2 pass encoding mode.

1. First pass stats collection process

The first pass stats collection process in the libvpx encoder is single threaded. All Macro Blocks (MBs) are processed in raster scan order within a frame. This process can be multiple threaded by processing different tile MB rows in parallel with top sync at MB level for resolving the dependency on top pixels for Intra prediction. Figure 1 demonstrates row based multi-threading (MT) with 2 tile columns and 4 threads. Threads 1 and 3 process tile column 0 and threads 2 and 4 process tile column 1.

first-pass-stats-collection
Figure 1. Proposed MT approach with two tile columns and four threads

The processing follows the assignment above until the relevant tile columns are completed.

If there are no tile MB rows to be processed in the current tile column, threads are assigned to other tile columns as shown in Figure 2. The multi-threaded implementation uses a job queue mechanism where each job corresponds to processing of a tile MB row.

reassignment-of-threads
Figure 2. Reassignment of threads with two tiles and 4 threads

2. Second pass encoding stage

The parallelism in the second pass of libvpx VP9 encoder is limited by the following factors:

  • Number of tile columns configured for a given resolution. For example, for 1080p resolutions, the maximum number of tile columns possible is 4, limiting the encoder to a maximum of 4 way parallelism.
  • Wastage due to variable thread processing times because of non-uniform tile column size and content variation across the tile columns

The above limitations can be addressed by using a job queue mechanism, as in Figures 1 and 2, where each job corresponds to tile SB row. The top sync is ensured as required for intra and MV prediction.

3. ARNR filtering

In the reference implementation, the filtering process is single threaded. All MBs in the frame are processed in raster scan order. This was multi-threaded using a job-queue mechanism similar to above. As the filtering process does not have any spatial dependencies, top sync is not required.

The row based multi-threading approach discussed above ensures that wastage due to variable thread processing times is minimal. This also leads to an improvement in encoding performance when the number of threads are increased beyond the number of tile columns. The methodology has a negligible impact on BD rate. The changes above were made in libvpx and are available as part of this GIT libvpx commit and communicated to the developer community here.

Table 1 captures the total encoding time reduction achieved in 2 pass encoding mode for different resolutions with row based multi-threading (keeping the computational resources identical).

 

Table 1. Encoding time reduction achieved with Threads=Max column tiles

Table 2 captures the total encoding time reduction achieved in 2 pass encoding mode for different resolutions with row based multi-threading (after doubling the computational resources).

 

Table 2. Encoding time reduction achieved with Threads=2 * Max column tiles

With up to 60-70% improvement in turnaround time, the optimized libvpx version significantly minimizes the computational cost and turnaround time barriers for adoption of VP9. Combined with the bandwidth gains over H.264/AVC encoding, the optimized implementation provides an efficient and viable option for encoding HD and UHD/4K streams for online video streaming applications.

More detailed analysis of the quality and performance of Ittiam’s optimized libvpx implementation was discussed in a paper at the SMPTE Annual Technical Conference 2016. The paper can be accessed at the IEEE online portal and a brief summary of the paper is available on this blog.

Reach out to us at mkt@www.ittiam.com for more insights.

Check out our solution page – VP9 Encoder