Identifying Pitfalls in the Evaluation of Saliency Models for Videos

—Saliency prediction has been extensively studied for natural images. In the area of video coding and video quality assessment, researchers attempt to integrate a saliency model to individual frames of a video sequence. In selecting best-performing saliency models for these applications, the evaluation only considers the average model performance over all frames of a video. This may mask the defects of a saliency model and consequently hinder further improvement of the model. In this paper, we present the identiﬁcation of pitfalls in the evaluation of saliency models for videos. We demonstrate the importance of considering the video content classiﬁcation and temporal effect. Building on the ﬁndings, we make recommendations for saliency model evaluation and selection for videos.


I. INTRODUCTION
The past few decades have witnessed a significant growth in the use of digital videos in our daily lives. Videos are inevitably subject to distortions generated by compression and transmission. The distortions in video signals result in the reduction in video quality, which affects observers' visual experience and task performance [1]. To be able to control, monitor and improve quality of digital videos, a great deal of attention has been paid to the development of advanced algorithms for video compression and video quality assessment.
A current research trend in video compression and video quality assessment is to consider visual attention, which represents a powerful feature of the human visual system (HVS) [2], [3]. Visual attention mechanism enables the HVS to select the most relevant information from the visual scene. Simulating selective attention is highly beneficial for computational algorithms to distinguish between relevant and irrelevant visual signals and adaptively determine their parameters and processes. Many saliency models are available in the literature [4]. These models predict visual attention by generating a so-called saliency map, which represents conspicuousness of scene locations reflecting the relative importance of different image regions [5]. Saliency models are incorporated in video algorithms to produce saliency maps for individual frames (with or without temporal feature adaptation) [3], [6]. The frame-level saliency map can be used to weight the local algorithm output, for example, a distortion map calculated by a visual quality metric is weighted by a saliency map to generate a quality score for each frame, which is then averaged over all frames to generate a sequence-level quality score [7]. The effectiveness of these saliency-based video algorithms largely depends on the accuracy of the saliency model used [3].
Saliency model evaluation has been extensively studied for still images, where a saliency similarity score is computed between the predicted saliency and the ground truth [8]. However, the evaluation of saliency models for videos is less studied. The current practice is to sum up the framebased evaluations and calculate a single score to represent the model accuracy for the entire video sequence. This evaluation regime neglects the temporal variations of saliency prediction accuracy and the impact of content classification on the overall performance of a saliency model. In this paper, based on the ground truth of eye-tracking data for videos [3], we perform statistical analyses to reveal the performance of state-of-theart saliency models and identify pitfalls in the evaluation of saliency models for videos. Findings can help build reliable benchmark of saliency models for videos.

A. Eye-tracking data
The SVQ160 database [3] represents a reliable eye-tracking study, in which the data collection implemented rigorous control mechanisms to eliminate experimental biases. Note, the stimuli of the SVQ160 database contain both pristine and distorted videos. In this study, we only use the ten pristine videos with the aim to make the analyses more generally applicable, as the saliency of distorted videos are exclusively relevant for video compression and quality assessment. The reference videos include a diverse range of video content as shown in Fig.1. The videos are about ten seconds long and have a resolutions of 768 × 432 pixels. Eye movements of 20 observers were collected for each video. A frame-level saliency map (FSM) is generated from fixations over all subjects; and by each fixation giving rise to a Gaussian kernel that simulates the foveal vision (2 • visual angle) of the HVS [3]. Fig.1 illustrates examples of the frame-level saliency maps.

B. Saliency models and performance measures
We selected a total of 14 state-of-the-art saliency models including seven traditional models and seven deep learningbased models. Table I lists these models and gives a brief description of each model.
A saliency model can be applied to each frame of a video sequence to generate a predicted frame-level saliency map (FSM). The prediction accuracy can be quantified by a similarity measure between the predicted FSM (pred F SM ) and the ground truth FSM (gt F SM ). Amongst the popular saliency similarity measures [8] Table I) measured by CC and SIM. Error bars indicate a 95% confidence interval. Baseline model is defined by "stretching a symmetric Gaussian to fit the aspect ratio of a given image, under the assumption that the center of the image is most salient [4]".
Similarity (SIM): SIM measures the similarity between the predicted and ground truth saliency maps when viewed as distributions pred F SM i and gt F SM i (equivalent to histogram intersection): where i represents the bin index of the histogram of a saliency map. The SIM's value range is between 0 and 1. The higher the SIM value is, the more accurate the saliency prediction is.
Pearson Linear Correlation Coefficient (CC): CC measures the linear correlation between the predicted saliency map pred F SM and the ground truth saliency map gt F SM : where σ pred FSM , σ gt FSM denote the standard deviation of pred FSM and gt FSM, respectively, and cov(pred FSM, gt FSM) denotes the covariance of the two saliency maps. The range of CC is between -1 and 1. When CC is close to -1 or 1, the two maps are highly correlated meaning the saliency prediction is accurate. The closer the CC is to 0, the less correlated are the two saliency maps meaning the saliency prediction is less accurate.

C. Proposed saliency analysis methods
The goal of this paper is to show how saliency models behave for video applications. This can help identify appropriate metrics for saliency model evaluation and guide the selection of saliency models for videos. Existing saliency evaluation method is based on aggregating frame-level saliency over time and producing a single score to represent the prediction accuracy of the saliency model. This method ignores important properties of video saliency, including biases caused by visual content and temporal effect. To capture these properties for saliency evaluation, we consider the following methods: Content-driven saliency dispersion (CSD): CSD [6] provides a quantitative measure for the degree of spatial saliency dispersion driven by visual content. Gaze is concentrated in fewer places in visual content with highly salient features than in content lacking salient features. Given a frame-level saliency map (FSM), CSD can be quantified by applying Shannon entropy to p × p non-overlapping blocks of the saliency map: where H represents the entropy of a block B, P max refers to the level of segmentation (i.e., P max = 4 was determined empirically in [6]). N max is the P max squared. The lower the CSD, the more concentrated the saliency is; otherwise, the higher the CSD, the more dispersed the saliency is in the spatial domain.
Temporal saliency prediction outliers (TSO): Temporal fluctuation of saliency prediction can significantly affect its application in video processing algorithms. A good saliency model should not only provide high time-aggregated accuracy, but also maintain its performance over time for a video sequence. To measure the temporal consistency of saliency prediction, TSO measures the ratio of outlier frames (N of ) to the total number of frames (N af ) of a video sequence. The outlier frames are the frames with a saliency prediction score (i.e., CC) below the threshold of CC mean − t × CC se , where CC mean and CC se denote the mean and standard error of CC over all frames of the video (i.e, t = 6 was determined empirically in our experiment). TSO is defined as: the lower the TSO value, the better the saliency prediction consistency is over time for the video.

A. Time-aggregated model performance
First, we use the conventional method to evaluate saliency models for videos. For each video, the saliency performance measure (i.e., SIM or CC) is calculated for each frame, which is averaged over all frames to produce a performance value. Fig.2 shows the time-aggregated performance for the 14 saliency models as described in Section II. It can be seen that deep learning-based models outperform traditional models, except for the FES model. All deep learning-based models are better than the baseline model, which is defined by "stretching a symmetric Gaussian to fit the aspect ratio of a given image, under the assumption that the center of the image is most salient [4]". Now, we challenge this conventional saliency evaluation by identifying saliency model behaviours masked by this method. To reduce biases in our further investigation, we only select the models that are above the baseline in either of the rankings in Fig.2, including UNISAL, VGG, FES, EML, FastSal, MSI, GazeGAN, ResNet, GBVS. Note, for consistency CC is used as the saliency evaluation measure in the following analysis.

B. Impact of video content on saliency model performance
Hypothesis: We hypothesize that the impact of video content (VC) on the performance of saliency prediction models is statistically significant.
We first define the video content (VC) variable as a classification of the content-driven saliency dispersion (CSD) measure of equation (3). We calculate the sequence-level CSD by taking the average of frame-level CSD values. Fig.3(a) shows the saliency dispersion degree for all videos. Based on observed CSD values, we could classify the videos into two groups, i.e., VC dispersed (including 'rh' to 'pa') and rh rb sh bs mc tr pa pr st sf VC compact (including 'pr', 'st' and 'sf'). In order to verify the content grouping is statistical meaningful, we perform hypothesis testing selecting CSD as the dependent variable and the categorical VC group as the independent variable. The Mann-Whitney U test is performed (due to evidence of nonnormality as per the Shapiro-Wilk test) [23], and the results (P < 0.05) show that the CSD of group VC dispersed is statistically significantly higher than that of group VC compact as shown in Fig.3(b). Now, for the two distinctive video content classes (i.e., VC dispersed and VC compact), we analyse the impact of video content on the performance of saliency perdiction models. For each video, based on the frame-level CC of equation (2), we compute a sequence-level CC by averaging CC values over all frames. Therefore, 14 saliency models yield 14 sequence-level CC for each video. A hypothesis testing is conducted selecting sequence-level CC as the dependent variable, and the categorical VC group as the independent variable. The Mann-Whitney U test is performed (due to evidence of non-normality as per the Shapiro-Wilk test), and the results (P < 0.05) show that the model performance on group VC dispersed is statistically significantly lower than that of group VC compact, as shown in Fig.4(a). It can be seen that the top-performing saliency models tend to capture saliency of VC compact videos but there is still room for improvement as the sequence-level CC = 0.48 remains inadequate as shown in Fig.4(a). However, these models fails in predicting the saliency of VC dispersed videos (i.e., sequence-level CC = 0.26 indicates poor accuracy as shown in Fig.4(a)). The evidence implies that predicting saliency of complex scenes (as indicated by the VC dispersed class, e.g., multiple objects) is more challenging than simple scenes (as indicated by the VC compact class, e.g., a single object with dominant motion) for videos.
In addition, we analyse the individual saliency models in responding to VC dispersed and VC compact videos. The Mann-Whitney U test is performed (due to evidence of nonnormality as per the Shapiro-Wilk test) on the sequencelevel CC values produced by each model, and the results (P < 0.05) show that the model performance on VC compact is statistically significantly higher than that of VC dispersed for each model, as shown in Fig.4(b). It can be seen that for the VC compact videos the model performance varies, e.g., UNISAL, VGG, FES and EML give a good prediction with CC larger than 0.5, and that all models consistently fail in predicting the saliency of the VC dispersed videos. This suggests that without considering the impact of video content, bad model performance could be masked depending on the test database. In summary, the significant difference in model behaviours for different video content classes (i.e., simple or complex scenes) deserves more attention in both the evaluation of saliency models for videos and the construction of video eye-tracking databases, so that the biases could be account for in further research.
C. Impact of temporal context on saliency model performance Hypothesis: We hypothesize that the impact of temporal context on the performance of saliency prediction models is statistically significant.
Little attention has been paid to the temporal variation of saliency model performance for videos. First, we measure the temporal consistency using TSO (temporal saliency prediction outliers) in equation (4). The TSO values of the nine topperforming saliency models are illustrated in Table II. In contrast to the time-aggregated performance values of Fig.  2(a), models rank high in the time-aggregated performance rankings do not necessarily give high performance consistency over time for videos, e.g., UNISAL ranks 1st in the time-aggregated rankings but 5th in the temporal-consistency rankings. The temporal consistency of saliency prediction is critical for applications such as video quality assessment and compression, where frame-based saliency weighting is often used [3].
Moreover, to analyse the model behaviours in the temporal context, we divide a video into ten consecutive blocks of time (i.e., each time block (TB) represents one second). For each time block, the mean of frame-level CC values (see equation (2)) over all videos (i..e, ten) and resulted from all (i.e., nine) saliency prediction models is computed. Fig.5(a) shows the model performance in time order (TO), indicating that the model prediction accuracy fluctuates over time for videos (e.g., a dip occurs around 3rd to 5th time blocks). Based on the observation of Fig.5(a), we classify the time order (TO) into three semantic categories, including TO 1 (time blocks 1-2), TO 2 (time blocks 3-5), and TO 3 (time blocks 6-10). Hypothesis testing is conducted with frame-level CC as the dependent variable and the categorical TO as the independent variable. Pair-wise comparison is performed using the Mann-Whitney U test (due to evidence of non-normality according to the Shapiro-Wilk test). The results (P < 0.05) show that the difference in model performance between any two TO groups is statistically significant, as shown in Fig.5(b). The evidence indicates that the prediction accuracy of saliency models significantly deteriorates towards the middle section of a video sequence. A plausible reason could be that there is much uncertainty around the middle of viewing. In the beginning of the scene (i.e., TO 1), observers predominantly focus on salient regions; as time evolves (i.e., TO 2) observers' viewing behavior might change and gaze might be shifted from salient regions due to the tendency of exploring nonsalient regions in the scene; after exploration observers would move gaze back to the salient regions during this space of time (i.e., TO 3). This speculation of observers' viewing behaviour could explain the extremely poor saliency model performance (i.e., CC = 0.27) for the middle section of a video sequence; meaning existing models cannot handle complex saliency shift. This temporal gaze behaviour poses challenges for accurately predicting saliency for videos. One way to address this problem is to include scene understanding components to saliency models, which is worth further investigation.

IV. CONCLUSION
In this paper, we formulate a new problem of saliency prediction -how to rigorously evaluate computational saliency models for videos. We found that video content has a significant impact on the performance of saliency models; and existing models fail in predicting saliency of videos of complex scenes. Also, the impact of temporal context on saliency model performance is significant; and existing models fail in capturing saliency in the middle section of a video. Findings can be used to facilitate the benchmark and selection of saliency models for video applications.