Perceptual Video Quality Metrics: Are they Ready for the Real World?

VMAF shows good correlation with MOS, promising to get better in future to serve as a consistent perceptual video quality metric for content adaptive encoding

The fundamental objective of most Content Adaptive Encoding (CAE) solutions is to optimize bandwidth without significantly reducing perceptual quality. While traditional fidelity metrics such as PSNR are well accepted as reliable, objective indicators of video quality, they are not good indicators of perceptual quality – and hence do not work well with CAE. The golden reference – DMOS – is perhaps the best indicator of human perceptual quality since it is rated by trained humans. For that very reason, it cannot be automated and hence is unviable for CAE solutions.

To fill the need for reliable metrics for CAE solutions, content creators are increasingly exploring several perceptual video quality metrics. These metrics including PSNR-HVSM, VQM-VFD, VMAF and ST-RRED provide an objective measure of how a user would perceive a video. Among these metrics, VMAF and ST-RRED index have the highest correlation with DMOS on several standard datasets as observed in the links above.

Assessing VMAF

During the course of identifying the right quality metric within THINKode – Ittiam’s machine leaning (ML) based CAE solution – we benchmarked a few of the metrics including VMAF to assess their consistency with real world content.

VMAF

Predicts the subjective video quality based on a reference and distorted video sequence.
The higher the VMAF score, the better the perceptual quality.
Developed by Netflix and emerging as one of the more popular metrics in the industry.

Our objective was to assess the ability of these metrics to distinguish between the quality of two different encoded video streams – one at good quality and the other at average quality. The difference between the metrics’ values needed to be consistent across the test set in quantifying the extra perceptual distortion. To find out, we did the following:

Identified a diverse set of video sequences to construct our own test set for the product
Generated bit streams at bitrates ranging from 500 Kbps to 15 Mbps to cover the span of MOS scores from 1 to 4

Mean Opinion Score (MOS) is expressed as a single rational number, in the range of 1–5, where 1 is the lowest perceived quality, and 5 is the highest perceived quality.

Generated MOS for outputs (through subjective DSIS assessment tests as specified in recommendation ITU-R BT.500-13) to act as a reference measure – including 0.5 decimal values to facilitate better comparison with the higher granularity scores of VMAF

The reference MOS values are represented in Figure 1 below.

Figure 1: MOS across bitrates for the test content

So how did VMAF fare?

For a vast majority of our content, we recorded a consistent correlation of VMAF with its respective MOS. VMAF is tuned to be linear in the MOS range of 2 to 3.5, which is a sweet spot for OTT content delivery. As seen in Figure 2, the cluster of VMAF scores drop almost linearly with a drop in MOS.

Figure 2: VMAF scores against corresponding MOS

VMAF Performance: Observed Inconsistencies

Figure 2 represents the data for the ‘vast majority’ and hence gives an average indication of the performance of VMAF. However, we wanted to test VMAF performance with regards to spatial pooling and for high grain noise. In the process, we encountered a few specific outlier sequences that did not produce the desired correlation.

Sequence 1: A sequence with low level of grain noise, used as a reference for good correlation

Sequence 2: Static background with action only in a small percentage of each frame – like logos, text, etc. VMAF was found to have a spatial pooling problem in such cases.

Sequence 3: High level of grain noise. We observed that VMAF tends to overestimate the MOS in such sequences.

As seen in Figure 3 below, VMAF does well with sequence 1 as the scores drop in line with the MOS. However, the other hand-picked sequences exhibit lower linearity due to the spatial pooling problems and overestimation of MOS, as indicated in the description of sequences 2 to 6 above.

Figure 3: VMAF scores against MOS for outlier sequences

Let us compare the two extremes in terms of correlation within the set of six sequences – sequence 1, 2 and 5.

Metric	Value Type	Sequence 1	Sequence 2	Sequence 3
MOS	Max	3.5	3.5	3.5
	Min	2	2	2
	Drop %	42.9%	42.9%	42.9%
VMAF	Max	95.8	99.5	91.1%
	Min	75.3	97.9	84.2
	Drop %	21.4%	1.6%	7.5%

Table 1: VMAF against MOS – variations across content

As Table 1 shows, for identical drops in in MOS, the gaps in VMAF for sequences 1, 3 and 5 are 21.4%, 1.6% and 7.5% respectively.

VMAF: A Strong Basis for Content Adaptive Encoding

We were not surprised to observe that VMAF works well ‘most of the time’ and can be a strong basis for CAE solutions. We also gained a clear understanding of the type of content where it does not operate with the same degree of consistency.

In addition, we also noticed that across our test content, VMAF failed to provide an acceptable degree of correlation with only 15% of content – thereby being 85% consistent. However, since 85% accuracy is typically insufficient, the CAE solutions have to fill these gaps through additional content analysis to characterize such behavior, or by leveraging additional backup QA steps.

The bottom line is there is scope for improvement, with these relatively new metrics poised to get better in future and achieve even higher levels of consistency.

Explore THINKode – Our ML based CAE solution