Knowledge-driven perceptual organization reshapes information sampling via eye movements

Humans constantly move their eyes to explore the environment and obtain information. Competing theories of gaze guidance consider the factors driving eye movements within a dichotomy between low-level visual features and high-level object representations. However, recent developments in object perception indicate a complex and intricate relationship between features and objects. Specifically, image-independent object-knowledge can generate objecthood by dynamically reconfiguring how feature space is carved up by the visual system. Here, we adopt this emerging perspective of object perception, moving away from the simplifying dichotomy between features and objects in explanations of gaze guidance. We recorded eye movements in response to stimuli that appear as meaningless patches on initial viewing but are experienced as coherent objects once relevant object-knowledge has been acquired. We demonstrate that gaze guidance differs substantially depending on whether observers experienced the same stimuli as meaningless patches or organized them into object representations. In particular, fixations on identical images became object-centred, less dispersed, and more consistent across observers once exposed to relevant prior object-knowledge. Observers’ gaze behaviour also indicated a shift from exploratory information-sampling to a strategy of extracting information mainly from selected, object-related image areas. These effects were evident from the first fixations on the image. Importantly, however, eye-movements were not fully determined by object representations but were best explained by a simple model that integrates image-computable features and high-level, knowledge-dependent object representations. Overall, the results show how information sampling via eye-movements in humans is guided by a dynamic interaction between image-computable features and knowledge-driven perceptual organization.


Introduction
More generally, a limitation of many previous studies that aim to show the contribution of 122 objects, or semantic meaning to oculomotor control is their reliance on a comparison to 123 computational models that calculate image-computable feature maps as their null instance, the conclusion that objects per se, rather than image-computable features that 135 are correlated with objects, guide human eye-movements, as suggested by Einhäuser and 136 colleagues (2008), was based on a comparison between manually labelled objects maps 137 and a map derived from one of the earliest saliency models (Itti & Koch, 2000). A re-analysis 138 of the data compared the object maps to other low-level models, including the AWS model performance (Borji et al., 2013). This finding thus led to a reversal of the original conclusion. 142 When a very similar dataset was analysed for a third time (Stoll et al., 2015), a more 143 sophisticated object map showed higher performance than even AWS. Note, however, 144 that since publication of these studies, saliency models that outperform AWS by a large  Independently of the favoured interpretation of these findings, there is a more 159 fundamental aspect that is easily overlooked. The emphasis on a dichotomous view, which 160 contrasts outputs of low-level feature models with 'objects' or 'semantic information', and 161 the tendency to conceptualise these as categorically different interpretations, has 162 Pedziwiatr, von dem Hagen, & Teufel 6 concealed a fundamental similarity between these explanations. Specifically, comparable 163 to how low-level models deal with simple features, most studies implicitly treat 'objects' 164 or 'semantic information' as image-computable properties. This notion is also the basis for 165 state-of-the-art computer vision models that aim to predict human fixations (e.g., Kroner,    In Experiment 1, observers viewed black and white two-tone images while their eye 222 movements were recorded. Two-tone images are derived from photographs of natural 223 scenes ('templates'). Each two-tone appears as meaningless patches on initial viewing.

224
Once an observer has acquired relevant prior object-knowledge by viewing the 225 corresponding template, however, processes of perceptual organization in the visual in response to two-tone images should be influenced by whether the observer 242 experiences the input as an object percept. Specifically, patterns of fixations on identical 243 two-tone images should be more similar to the ones from the corresponding template 244 when an observer experiences the two-tone image as a meaningful object percept 245 compared to when they experience it as meaningless patches. To test these predictions, we recorded eye-movements of 36 human observers who 248 viewed two-tone images before and after being exposed to the relevant templates 249 (Before, After, and Template conditions, respectively; see Fig. 2). In the Before condition, 250 observers perceive two-tone images as meaningless black and white patches. In the After 251 condition, prior object-knowledge allows them to bind patches into meaningful object 252 percepts. Crucially, any potential differences in eye movements between the Before and 253 the After conditions cannot be explained by image-computable features because these are 254 identical across these conditions; the only aspect that has changed is the prior object-    can be found. In brief, template images -predominantly photographs of animals in their 281 natural environments -were taken from the Corel Photo library. Two-tones were 282 generated by smoothing and binarising template images. A good two-tone image should 283 be perceived as a collection of meaningless patches prior to seeing its template but 284 observers should be able to easily bind the stimulus into a coherent percept of an object 285 after they see the template. Extensive tests on naïve observers were conducted to select 286 both the template images, and the parameters of smoothing and binarisation that 287 guarantee that the created two-tones have these desired properties.   The experiment consisted of ten blocks, a single block is schematically illustrated in Fig. 2. 316 Before the start of the procedure, a 13-point eye-tracker calibration and validation was 317 conducted. Each block started with the Before condition, in which three two-tones were 318 presented in a sequence, each for 3 seconds. Observers were instructed to carefully look 319 at these images. Two-tone images were preceded by a centrally-located fixation-dot 320 displayed for 1 second. They were followed by a visual analogue scale, which observers 321 adjusted by pressing 'z' and 'm' buttons on a keyboard to indicate how meaningful they 322 We covered all images because bioRxiv does not allow posting images containing faces. You can still access this image via the following link: https://bit.ly/3nYBfa4 12 experienced the two-tone image to be. Meaningfulness ratings were used as a 323 manipulation check. After each rating, a blank screen was displayed for 500 ms. The Before 324 condition was followed by the Template condition, in which template images were 325 displayed while eye-movements were recorded -again, each for 3 seconds, preceded by a 326 fixation dot. After the Template condition, we ensured that observers had enough object-327 knowledge to bind two-tone images into meaningful object percepts by presenting six 328 cycles of dynamic blending between two-tones and their templates (Blending Phase). Each 329 cycle began with the presentation of a template image for two seconds. This was then 330 linearly blended into the corresponding two-tone image, with the full transition from 331 template to two-tone taking 4 seconds. The two-tone image remained on the screen for 2 332 seconds and then was blended back into the template, remaining on the screen for 333 another 2 seconds. Each of the three image-pairs used in a block was presented in a full 334 blending procedure twice with the order pseudorandomised such that the same pair was 335 never used twice in a row. The subsequent cycles of blending were separated with a blank 336 screen presented for 500 ms. After the Blending Phase, the After condition was presented, 337 which was identical to the Before condition except that images were presented in a newly 338 randomized order. There was a break every two blocks, and the eye-tracker was re-339 calibrated. For each observer, images were assigned to blocks randomly and were 340 presented in a pseudo-random order within each block. The pseudo-randomization 341 ensured that the image shown last in the Blending Phase was never presented at the 342 beginning of the After condition. Total experiment time was ~50 minutes.

344
Instructions were delivered verbally and on-screen. Key elements of the procedure were 345 illustrated visually: observers were shown a single two-tone image (which was not used in 346 the actual experiment), rated its meaningfulness, viewed the blending procedure with the 347 template and, finally, viewed the same two-tone again and were asked to provide a 348 meaningfulness rating.  368 perceptual organization is an important driver of oculomotor control. However, it is also 369 consistent with three alternative interpretations, which we excluded in Experiments 2 and 3. 370 In Experiment 2, the template was the original grayscale photograph, from which the two-371 tone had been generated, but by contrast to Experiment 1, it was mirror-flipped. This The default EyeLink algorithm was used to extract fixation-locations from the eye-387 movement recordings. Further data pre-processing was done in Matlab. For each image, 388 we discarded the initial fixation that was directed at the fixation-dot presented before 389 image onset. We also discarded fixations not landing within the image-boundaries. Further    [0.01, 0.14]). Importantly, however, this difference was small, suggesting that a centre bias 462 explained most, but not all, of the Template-Before similarity.     Template conditions to compare with the heatmaps of the After condition (Fig. 6). Each The results of this similarity analysis suggest that both first and all remaining fixations in 575 the After condition were guided synergistically by image-computable features and object 576 representations (Fig. 7). The linear-combination heatmaps that had the highest similarity The findings for all remaining fixations from the After condition were similar (Fig. 7B).

586
However, the linear combinations that were optimal for these fixations were more  object-to-location mapping hypothesis, which suggests that observers merely revisited 719 the parts of the display, which contained objects during the presentation of template 720 images, we would expect a high similarity between heatmaps from the After and Template 721 conditions, despite the lack of overlap in spatial location of objects in these two conditions.

722
If, however, the effects observed in Experiment 1 were attributable to knowledge-723 dependent object representations, we would expect the similarity between the After and 724 Template conditions to be low (see Fig. 3 for illustration). Moreover, by mirror-flipping the 725 heatmaps obtained from the mirror-flipped templates, we would expect an increase in 726 similarity to levels seen in Experiment 1 (because this leads to a re-alignment of heatmaps 727 from templates and two-tones). indicates that observers were able to bind the two-tone images into meaningful percepts 745 despite viewing templates, which were presented in a mirror-flipped manner.

747
The results of the eye-movements data analysis were inconsistent with the object-to-748 location hypothesis but provided support for the idea that knowledge-dependent object  In the first block, the same dummy templates -greyscale images not related to any of the 821 two-tones -were always presented. In all other blocks, the assignment of stimuli to 822 experimental blocks was pseudo-randomized for each observer individually in a way which 823 guaranteed that dummy templates presented in any given block were the real templates 824 of two-tones presented in the preceding block (see Fig. 11). To ensure that we included 825 data from the same number of observers for each two-tone and template, we had to 826 discard fixations registered on the two-tones presented in the final experimental block and 827 fixations from the dummy templates from the first blocks ('initial templates'). Note that -828 because we pseudo-randomized the order of stimulus presentation for each observer 829 individually -for different images we had to discard data from different observers.

830
Importantly, however, for each image set consisting of a two-tone (viewed in Before and

831
After condition), its dummy template, and its real template, we retained data from a 832 homogenous group of 18 observers (out of 20 who completed the experiment), but the 833 composition of these groups was different for different image sets. The analysis of meaningfulness ratings demonstrated that, as expected, observers were 848 not able to bind the two-tone images into coherent object percepts even in the After 849 condition ( Fig. 12A and B). In particular, the differences in ratings between Before and 850 After conditions were not statistically significant, both when the data were averaged per   The findings of Experiment 1 cannot be attributed to order effects 892 In the final analysis, we considered the possibility that order effects explain the key 893 findings of Experiment 1 and 2. Specifically, we asked whether viewing the same two-tones 894 for a second time without receiving prior object-knowledge could change fixation patterns 895 such that they would resemble the patterns from the (real) templates. Recall that the 896 design of Experiment 3 ensured that observers saw each two-tone image twice, each time 897 without prior object-knowledge (Before and After conditions, respectively) and they also 898 saw the real template for these two-tones in the following block. If the findings in 899 Experiments 1 and 2 resulted, at least partly, from an order effect, we would expect that 900 the similarity in fixation patterns in the (real) Template-After pair would be higher than in 901 the (real) Template-Before pair in the current experiment.

903
The results were inconsistent with this 'second-viewing' hypothesis (Fig. 12D). The  When an observer explores the environment with no specific task other than to obtain 926 information, control of eye movements is typically considered within a dichotomy 927 between low-level features and high-level object representations. Here, we abandoned 928 this simplifying framework in light of emerging evidence highlighting the complex and 929 intricate relationship between image-computable features and high-level object 930 representations in visual perception. We recorded eye movements in response to two-931 tone images, stimuli that appear as meaningless patches on initial viewing but, once 932 relevant object-knowledge has been acquired, are organized into coherent and 933 meaningful percepts of objects. In the current study, prior object-knowledge was provided 934 in the form of template images, i.e., the unambiguous photographs from which the two-935 tone images had been generated. Across three experiments, fixation patterns on the same 936 two-tone images differed substantially depending on whether observers experienced 937 them as meaningless patches or organized them into object representations. In particular, 938 when organized into object representations, we found that fixation patterns on two-tone 939 images were more similar to those on templates, more focused on object-specific, pre- The idea that knowledge-driven object representations restructure human eye-949 movements is supported by both our general assessment of fixation distributions 950 between two-tone images and template, and also by a more specific analysis focusing on  Equally important as the finding that knowledge-driven object representations guide 977 human gaze is the fact that they do not fully determine the selection of fixation locations.

978
While eye-movements on two-tone images changed once they elicited object 979 representations such that fixation distributions became more similar to fixations on