Perceptual Video Quality Metrics: Are they Ready for the Real World?
by Nithya V S (Lead Engineer, Media Server Technologies – Codecs)
VMAF shows good correlation with MOS, promising to get better in future to serve as a consistent perceptual video quality metric for content adaptive encoding
The fundamental objective of most Content Adaptive Encoding (CAE) solutions is to optimize bandwidth without significantly reducing perceptual quality. While traditional fidelity metrics such as PSNR are well accepted as reliable, objective indicators of video quality, they are not good indicators of perceptual quality – and hence do not work well with CAE. The golden reference – DMOS – is perhaps the best indicator of human perceptual quality since it is rated by trained humans. For that very reason, it cannot be automated and hence is unviable for CAE solutions.
To fill the need for reliable metrics for CAE solutions, content creators are increasingly exploring several perceptual video quality metrics. These metrics including PSNR-HVSM, VQM-VFD, VMAF and ST-RRED provide an objective measure of how a user would perceive a video. Among these metrics, VMAF and ST-RRED index have the highest correlation with DMOS on several standard datasets as observed in the links above.
During the course of identifying the right quality metric within THINKode – Ittiam’s machine leaning (ML) based CAE solution – we benchmarked a few of the metrics including VMAF to assess their consistency with real world content.
Our objective was to assess the ability of these metrics to distinguish between the quality of two different encoded video streams – one at good quality and the other at average quality. The difference between the metrics’ values needed to be consistent across the test set in quantifying the extra perceptual distortion. To find out, we did the following:
Mean Opinion Score (MOS) is expressed as a single rational number, in the range of 1–5, where 1 is the lowest perceived quality, and 5 is the highest perceived quality.
The reference MOS values are represented in Figure 1 below.
For a vast majority of our content, we recorded a consistent correlation of VMAF with its respective MOS. VMAF is tuned to be linear in the MOS range of 2 to 3.5, which is a sweet spot for OTT content delivery. As seen in Figure 2, the cluster of VMAF scores drop almost linearly with a drop in MOS.
Figure 2 represents the data for the ‘vast majority’ and hence gives an average indication of the performance of VMAF. However, we wanted to test VMAF performance with regards to spatial pooling and for high grain noise. In the process, we encountered a few specific outlier sequences that did not produce the desired correlation.
Sequence 1: A sequence with low level of grain noise, used as a reference for good correlation
Sequence 2: Static background with action only in a small percentage of each frame – like logos, text, etc. VMAF was found to have a spatial pooling problem in such cases.
Sequence 3: High level of grain noise. We observed that VMAF tends to overestimate the MOS in such sequences.
As seen in Figure 3 below, VMAF does well with sequence 1 as the scores drop in line with the MOS. However, the other hand-picked sequences exhibit lower linearity due to the spatial pooling problems and overestimation of MOS, as indicated in the description of sequences 2 to 6 above.
Let us compare the two extremes in terms of correlation within the set of six sequences – sequence 1, 2 and 5.
|Metric||Value Type||Sequence 1||Sequence 2||Sequence 3|
As Table 1 shows, for identical drops in in MOS, the gaps in VMAF for sequences 1, 3 and 5 are 21.4%, 1.6% and 7.5% respectively.
We were not surprised to observe that VMAF works well ‘most of the time’ and can be a strong basis for CAE solutions. We also gained a clear understanding of the type of content where it does not operate with the same degree of consistency.
In addition, we also noticed that across our test content, VMAF failed to provide an acceptable degree of correlation with only 15% of content – thereby being 85% consistent. However, since 85% accuracy is typically insufficient, the CAE solutions have to fill these gaps through additional content analysis to characterize such behavior, or by leveraging additional backup QA steps.
The bottom line is there is scope for improvement, with these relatively new metrics poised to get better in future and achieve even higher levels of consistency.