VP9 Video Encoder with Faster Turnaround

Posted by
libvpx is a software video codec library from Google which serves as the reference software implementation for the VP8 and VP9 video coding standards.

libvpx

is distributed as open source software under a revised BSD License.
VP9 video encoding algorithms, as implemented in libvpx, offer a BD rate improvement of up to 40% over H.264/AVC encoders for typical high quality presets in 2 pass encoding mode. This makes libvpx (and VP9) an attractive candidate for usage in applications such as Over The Top (OTT) video delivery services.
However, in comparison with H.264/AVC encoders, libvpx has lower encoding speeds that can result in longer turnaround times. For example, using version 1.6.0 of libvpx, 2 pass encoding speed of ‘good’ –
cpu-used=1 preset is observed to be up to 2x slower as compared to ‘very slow’ preset of x264 encoder on the same hardware with similar thread configurations. In spite of the bandwidth gains on offer, this gap in performance could create barriers for adoption of

VP9

technology.
We recently worked on a project to improve the performance of the libvpx encoder implementation in partnership with Netflix and Google. The overall message was communicated through this press release and this blog provides more details about the

efficient multi-threading

implementation that has enabled a 50-70% improvement in speed with no loss in quality.
Unlike assembly level optimizations,

multi-threading optimization

s are architecture agnostic and applicable to any multi-core processor. As part of the improvements, multi-threading was applied to the following three blocks of libvpx which suffered from poor performance in 2 pass encoding mode.

1. First pass stats collection process

The first pass stats collection process in the libvpx encoder is single threaded. All Macro Blocks (MBs) are processed in raster scan order within a frame. This process can be multiple threaded by processing different tile MB rows in parallel with top sync at MB level for resolving the dependency on top pixels for Intra prediction. Figure 1 demonstrates row based multi-threading (MT) with 2 tile columns and 4 threads. Threads 1 and 3 process tile column 0 and threads 2 and 4 process tile column 1.
first pass stats collection, optimized libvpx encoder
Figure 1. Proposed MT approach with two tile columns and four threads

The processing follows the assignment above until the relevant tile columns are completed.

If there are no tile MB rows to be processed in the current tile column, threads are assigned to other tile columns as shown in Figure 2. The multi-threaded implementation uses a job queue mechanism where each job corresponds to processing of a tile MB row.
Optimized libvpx VP9 encoder, multi-threaded implementation
Figure 2. Reassignment of threads with two tiles and 4 threads

2. Second pass encoding stage

The parallelism in the second pass of

libvpx VP9 encoder

is limited by the following factors:
  • Number of tile columns configured for a given resolution. For example, for 1080p resolutions, the maximum number of tile columns possible is 4, limiting the encoder to a maximum of 4 way parallelism.
  • Wastage due to variable thread processing times because of non-uniform tile column size and content variation across the tile columns
The above limitations can be addressed by using a job queue mechanism, as in Figures 1 and 2, where each job corresponds to tile SB row. The top sync is ensured as required for intra and MV prediction.

3. ARNR filtering

In the reference implementation, the filtering process is single threaded. All MBs in the frame are processed in raster scan order. This was multi-threaded using a job-queue mechanism similar to above. As the filtering process does not have any spatial dependencies, top sync is not required.
The row based multi-threading approach discussed above ensures that wastage due to variable thread processing times is minimal. This also leads to an improvement in encoding performance when the number of threads are increased beyond the number of tile columns. The methodology has a negligible impact on BD rate. The changes above were made in libvpx and are available as part of this GIT libvpx commit and communicated to the developer community here.
Table 1 captures the total encoding time reduction achieved in 2 pass encoding mode for different resolutions with row based multi-threading (keeping the computational resources identical).
S.No.ResolutionMaximum number of tile columns allowedThreadsTotal encoding time reduction (in %)
1 1920*1080 8 bit4429%
21920*1080 10 bit4431%
32048*1024 10 bit8851%
42048*1080 10 bit8840%
53840*2160 10 bit8845%
64096*2048 10 bit161665%
74096*2160 10 bit161652%
Table 1. Encoding time reduction achieved with Threads=Max column tiles
Table 2 captures the total encoding time reduction achieved in 2 pass encoding mode for different resolutions with row based multi-threading (after doubling the computational resources).
S.No.ResolutionMaximum number of tile columns allowedThreadsTotal encoding time reduction (in %)
1 1920*1080 8 bit4856%
21920*1080 10 bit4858%
32048*1024 10 bit81665%
42048*1080 10 bit81661%
53840*2160 10 bit81665%
64096*2048 10 bit163275%
74096*2160 10 bit163269%
Table 2. Encoding time reduction achieved with Threads=2 * Max column tiles
With up to 60-70% improvement in turnaround time, the optimized libvpx version significantly minimizes the computational cost and turnaround time barriers for adoption of VP9. Combined with the bandwidth gains over

H.264/AVC encoding

, the optimized implementation provides an efficient and viable option for encoding HD and

UHD/4K

streams for online video streaming applications.
More detailed analysis of the quality and performance of Ittiam’s

optimized libvpx

implementation was discussed in a paper at the

SMPTE Annual Technical Conference

2016. The paper can be accessed at the IEEE online portal and a brief summary of the paper is available on this blog.
Twitter IconLinkedinLinkedin