한국어 고급 학습자의 다중양식 읽기 전략: 텍스트, 이미지, 오디오 통합에서의 시선추적 연구
Multimodal Reading Strategies in Native and Advanced L2 Korean Speakers: An Eye-Tracking Study of Text, Image, and Auditory Integration
Article information
Abstract
배경 및 목적
본 연구는 들으면서 읽기(Reading-while-listening, RWL) 조건에서 청각 입력이 한국어 모어 화자(L1)와 고급 한국어 학습자(L2)의 다중양식 텍스트 이해에 미치는 효과를 살펴보고, 이때 나타나는 읽기 전략과 작업 기억(working memory, WM)의 상호작용을 규명하는 데 목적이 있다.
방법
28명의 참여자(한국어 모어 화자 14명, 고급 학습자 14명)를 대상으로 시선 추적 기법을 사용하여 글과 그림, 그리고 오디오로 구성된 다중양식 텍스트를 읽는 동안의 안구 운동 패턴을 시선 고정 시간과 고정 횟수, 이동 횟수 등을 통해 분석하였다. 또한, WM 용량을 측정하고, 특정 읽기 전략 채택에 미치는 영향을 살펴보았다.
결과
분석 결과, L1과 L2 모두 시각 자료보다는 문자 정보에 더 많은 주의를 기울였다. 특히 L2 학습자의 경우 고정 횟수와 시간이 유의미하게 더 높아 인지 부하가 큰 것으로 나타났다. 전략 사용에 있어서는 L2 학습자는 주로 국지적 전략을 사용한 반면, L1 학습자는 통합적 전략(content integration reading)을 더 많이 사용하였다. WM 분석 결과, L1 학습자의 경우 WM이 높을수록 핵심어 기반 시선 전환 전략(keyword-triggerd switching)을 덜 사용하고 통합적 전략을 더 많이 사용하는 경향이 확인되었으나, L2 학습자에게서는 이러한 효과가 뚜렷하게 드러나지 않았다.
논의 및 결론
본 연구결과는 RWL 상황에서 청각 입력이 L2 학습자의 이해를 보조할 잠재력이 있으나, 여전히 인지 부하가 남아 있어 전략적 지원이 필요함을 시사한다. 특히 L2 학습자의 경우, WM 자원이 상대적으로 제한되거나 언어적 자동화가 충분히 이루어지지 않아 그림·텍스트 정보를 효율적으로 통합하지 못할 수 있으므로, 교수학습 현장에서 어휘 난이도 조정, 시각 자료 설계, 전략 교육 등을 통한 체계적 개입이 요구된다.
Trans Abstract
Objectives
This study investigates the impact of auditory input, specifically within the Reading While Listening (RWL) framework, on the comprehension of multimodal texts among native Korean speakers (L1) and advanced Korean learners (L2). Additionally, it explores differential reading strategies and the influence of working memory (WM) capacity on the processing of multimodal content.
Methods
Eye-tracking technology was utilized to examine the eye movement patterns of 28 participants (14 L1, 14 L2) as they engaged with multimodal texts consisting of written, visual, and auditory components. Key metrics— including dwell time, fixation count, and saccadic movements—were analyzed to gain insights into the cognitive processing behaviors of both groups. Furthermore, WM capacity was assessed to determine its effect on the adoption of specific reading strategies.
Results
The findings revealed that both L1 and L2 learners predominantly focused on textual rather than visual information, with L2 participants exhibiting significantly higher fixation counts and longer fixation durations, indicative of elevated cognitive load. In terms of strategy use, L2 learners relied more on localized, compensatory strategies (e.g., keyword-triggered switching), whereas L1 participants more frequently employed integrative strategies such as content integration reading.
Conclusion
The results suggest that while auditory input in RWL contexts can facilitate multimodal comprehension, it does not fully alleviate the heightened cognitive demands observed among L2 learners. Instructional approaches aiming to address these challenges may benefit from incorporating explicit strategy training and considering learners’ WM capacity to optimize multimodal reading outcomes.
Comprehending written text is fundamental in both first (L1) and second (L2) language acquisition, serving as the foundation of effective communication and sustained language development. As educational materials increasingly incorporate multimodal elements— combining text, images, and auditory input—the need to understand how learners process these diverse sources becomes more pressing. Research has shown that multimodal texts, which integrate visual (e.g., illustrations, graphs) and auditory stimuli with written content, can engage multiple cognitive processes simultaneously and thus enhance comprehension (Mayer, 2003). This effect is especially pronounced for L2 learners, who often depend on multiple channels to support their understanding and compensate for linguistic limitations (Chang, 2011; Chang & Choi, 2014).
Pellicer-Sánchez, Conklin, Rodgers, and Parente (2021) demonstrated that L2 learners allocate more attention to images under reading-while-listening (RWL) conditions compared to text-only scenarios, highlighting the compensatory role of visual aids when addressing linguistic challenges. Similarly, Holsanova (2020) emphasized that complex multimodal texts require learners to integrate linguistic and visual elements dynamically, which can increase cognitive demands but also foster deeper comprehension if managed effectively. These findings align with Paivio’s dual coding theory (1986), which posits that integrating verbal and nonverbal information facilitates comprehension.
Despite the recognized benefits of multimodal texts, several gaps remain. In particular, a deeper understanding is required of how individual differences—such as working memory (WM) capacity— influence learners’ interactions with multimodal input. These differences affect how learners manage cognitive load and apply reading strategies, ultimately shaping comprehension outcomes (Lim & Kim, 2024). Addressing this complexity involves examining the cognitive mechanisms that underlie multimodal processing, especially for L2 learners who may struggle when relying solely on textual information.
Eye-tracking methodologies have proven instrumental in investigating how learners allocate attention in multimodal contexts (Warren, Boers, Grimshaw, & Siyanova-Chanturia, 2018). Studies have indicated that when images accompany text, L2 learners may shift attention from print to visuals (Tragant & Pellicer-Sánchez, 2019; Warren et al., 2018), and that dynamic images can further influence these patterns. Van der Sluis, Eppinga, & Redeker (2017) observed that the spatial arrangement of text and images in multimodal instructions significantly affects attention distribution and comprehension, underscoring the importance of design elements. While L1 learners often focus primarily on textual elements, L2 learners tend to engage more extensively with visuals as textual complexity increases (Zhao, Schnotz, Wagner, & Gaschler, 2014). Such greater visual exploration can reduce the cognitive load associated with unfamiliar language, once again aligning with Paivio’s dual coding theory (1986).
Recent findings suggest that auditory input may further enhance textual-visual integration (Pellicer-Sánchez et al., 2021). In addition, Brosseuk and Downes (2024) argued that fostering metalinguistic awareness in learners, particularly in multimodal literacy contexts, equips them to better navigate and synthesize diverse input sources. This perspective reinforces the need for explicit exploration of how learners integrate multimodal content effectively.
Working memory plays a key role in managing the cognitive demands of multimodal input (Daneman & Carpenter, 1980; Seong, Lee, & Choi, 2020; Yoo, 2009). Lim and Kim (2024) found that L1 readers benefit more from certain WM components than advanced L2 readers, suggesting that L2 learners might rely on external scaffolds—such as images—rather than internal WM resources. Previous studies have also associated prolonged text fixation with comprehension difficulties, while greater attention to images can improve understanding (Pellicer-Sánchez et al., 2020; Serrano & Pellicer-Sánchez, 2022). This contrast underscores the importance of identifying strategies that support effective multimodal integration, particularly for L2 learners.
RWL offers one avenue for reducing intrinsic cognitive load by aligning auditory cues with visual attention, potentially assisting L2 learners who encounter difficulties with text alone (Conklin, Pellicer-Sánchez, & Carrol, 2018). Multimedia learning (Mayer, 2009) and cognitive load theories (Sweller, 1994) suggest that distributing information across multiple modalities can optimize learning by balancing cognitive resources. However, while research has examined text-and-image conditions (Pellicer-Sánchez et al., 2020; Zhao et al., 2014), fewer studies have addressed how auditory input interacts with textual and visual elements. Although Kim, Yoon, Nam, Lee, and Yim (2023) have shown that audio-assisted reading (AR) can facilitate reading comprehension among younger L1 learners, it remains unclear whether this advantage extends similarly to advanced L2 populations facing additional lexical and syntactic challenges.
Rather than directly comparing conditions with and without auditory support, the present study considers the RWL environment as a distinct and meaningful context that more closely resembles authentic educational scenarios. Holsanova (2020) noted that multimodal scientific texts require iterative integration of diverse sources, aligning with Mayer’s (2009) redundancy principle. Yet this principle may function differently for advanced L2 learners, who may benefit differently from additional modalities.
Against this backdrop, the present study pursues the following objectives: (1) to investigate how L1 and advanced L2 learners allocate attention to text and visual elements in multimodal RWL conditions; (2) to identify the reading strategies employed to manage the cognitive demands of multimodal integration and discern how these strategies differ between L1 and L2 groups; and (3) to examine how WM capacity influences reading strategies in RWL contexts.
Accordingly, the research questions guiding this inquiry are
1. How do L1 and advanced L2 learners allocate attention to text and visual elements under RWL conditions?
2. What reading strategies do L1 and L2 learners employ in multimodal integration, and in what ways do these strategies differ between groups?
3. How does WM capacity shape the reading strategies used by L1 and L2 learners in RWL contexts?
By examining these questions through eye-tracking measures, the study aims to clarify the relationship between WM, multimodal integration, and reading strategies in L1 and L2 populations. Understanding these dynamics may inform instructional design, guiding the development of more effective multimodal resources that accommodate learners’ cognitive capacities and linguistic backgrounds.
METHODS
Participants
Initially, thirty-four participants were recruited for the study. However, data from six participants were excluded due to poor calibration and track loss, resulting in less than 80% valid data during the testing process.1 Consequently, a total of 28 participants were included in the final analysis, as displayed in Table 1. This sample comprised 14 L2 speakers (11 female, 3 male) and 14 L1 speakers (11 female, 3 male) of Korean. All L2 participants were enrolled in universities or graduate programs located in the Seoul metropolitan area and held a TOPIK Level 6 certificate, indicating advanced Korean proficiency.
The L2 learners came from diverse national backgrounds: China (10), Japan (2), Russia (2), and Kazakhstan (1). The mean age of the Korean language learners was 21.9 years (SD=3.95), whereas the mean age of the native Korean speakers was 25.8 years (SD=2.06). The L2 speakers were advanced learners of Korean, enrolled in a higher education institution, and had studied Korean for an average of 73.3 months (SD=24.84). They had also resided in Korea for an average of 39.3 months (SD=20).
Participants were recruited by posting a study announcement, which included the study’s purpose and procedures, on social media and university community boards. During recruitment, potential participants were informed that individuals wearing glasses would be excluded due to the possibility of decreased eye-tracking accuracy caused by light reflections. Interested participants were invited to the laboratory, where they received a detailed explanation of the study procedures and provided informed consent before beginning the experiment. Upon completing the experiment, participants were compensated with 15,000 KRW in cash.
Reading Materials
To examine how learners process multimodal texts under RWL conditions, argumentative, narrative, and expository texts were selected. These genres represent common reading materials for learners and thus allow an examination of diverse reading strategies. The texts were sourced from intermediate-level Korean language and middle school textbooks to ensure alignment with the target proficiency levels.2
To validate the texts’ difficulty and suitability, three experienced Korean language instructors (each with over ten years of teaching experience) and five intermediate-level learners reviewed the materials. They provided feedback on text difficulty, length, appropriateness of visual aids, and the alignment of comprehension questions with textual content. Based on their input, the texts were revised and finalized to match the learners’ proficiency levels and cognitive load requirements.
A standardized format was employed: titles were placed at the top-left, written text below the title on the left, and visuals on the right. Since effective multimodal comprehension depends on integrating written text and visual materials (Yoo, 2009), maintaining consistency in layout was essential. Previous research by Levie and Lentz (1982) found that incongruent images hinder learning, while Mayer (2009) noted that congruency between textual and visual elements enhances learning efficiency. Following these principles, the materials in this study ensured a high degree of alignment between textual and visual content.
To minimize order effects, the texts were presented in a randomized sequence, ensuring that each participant encountered a different order of text genres. For narrative texts, illustrations that effectively depicted narrative flow were chosen. Expository texts used composite materials that combined text and images to facilitate understanding, and argumentative texts included graphs representing arguments and data. Auditory narration was recorded by a female native Korean speaker using standard Korean. On average, audio playback durations were 97.5 seconds (SD=5) for argumentative texts, 85 seconds (SD=6) for expository texts, and 97.5 seconds (SD=9.5) for narrative texts. The average word speed varied slightly by text type: approximately .75 seconds per word for argumentative texts, .65 seconds per word for expository texts, and .78 seconds per word for narrative texts. Appendix 1 provides examples of the reading materials.
Comprehension Test
After reading each text, a comprehension check consisting of one factual and one inferential question was administered to measure participants’ understanding. Factual questions required recalling explicitly stated information, while inferential questions demanded integrating contextual clues with prior knowledge. Both L1 and L2 participants demonstrated high accuracy (L1: 99.40%, SD=2.23%; L2: 96.3%, SD=8.20%), suggesting that they could perform the reading tasks adequately.
An independent t-test comparing accuracy rates between L1 and L2 groups showed no statistically significant difference, t=1.54, p=.140, with a mean difference of 3.11%. These results indicate that the materials were comprehensible to both groups. Appendix 2 presents examples of the comprehension questions.
Working Memory Assessment Task
A working memory measurement tool developed by Baik (2014) was employed. This validated phonological WM measure minimized lexical complexity by using frequently encountered one- to three-syllable Korean words, ensuring that the task primarily reflected WM capacity rather than vocabulary knowledge.3
Participants performed simple arithmetic (e.g., if ‘(4/2) + 2=6’ is correct, press ‘y’; if incorrect, press ‘n’) while listening to recorded Korean words, then verbally recalled all the words heard.4 Their spoken recall was recorded and scored by two independent evaluators, assigning ‘1’ for correct and ‘0’ for incorrect responses. In cases of unclear pronunciation, a partial score of .5 was awarded, acknowledging minor articulation issues without fully penalizing participants.5
Visual WM was assessed using a grid pattern memorization task, where participants shaded a blank grid to reproduce a previously shown pattern. Do and Lee (2006) noted that WM usage varies by text type, emphasizing the importance of measuring both phonological and visual WM. This study, unlike previous ones focusing solely on phonological WM, assessed both subcomponents. Accuracy in the visual WM task was recorded using E-Prime software.
Eye-Tracking
This study utilized an eye tracker to analyze participants’ visual behavior while they engaged with multimodal reading materials. An eye tracker is a research tool that measures the movement of participants’ pupils and corneal reflections through infrared projection, allowing researchers to determine the specific areas where participants focus by tracking their eye movements (Shin & Shin, 2012). The Gazepoint system from iMotions, specifically the Gazepoint-GP3, was used in this study. This device records data at a frequency of 150 Hz, meaning it measures pupil movements 150 times per second, which enables highly detailed tracking of eye movement.6 The accuracy of the eye tracker ranged from 0.5° to 1°, with the average deviation in this study falling within these values, ensuring precise gaze data collection.
The system operated using a screen-based method, in which an infrared camera mounted on a desk analyzed eye movements without any direct contact, utilizing the infrared light reflected from participants’ eyes. The eye tracker was positioned at the bottom of the monitor, allowing participants to engage in the reading tasks in a natural and unrestricted setting. Participants sat in front of a computer monitor and used a mouse, which replicated a typical computer usage environment. Unlike more restrictive eyetracking devices, this system allowed for relatively greater freedom of head movement, which contributed to a more immersive and realistic experimental setting.
Calibration was conducted for each participant to ensure accuracy in tracking. This procedure involved adjusting the eye tracker based on each participant’s eye shape, light reflection, and movement range. The experiment commenced only once all nine calibration points were successfully registered. To maintain precision, calibration was repeated whenever necessary throughout the experiment.
Several key eye movement metrics were examined to provide a detailed analysis of the reading process: dwell time, fixation count, average fixation duration, and integrative saccades. Each of these metrics provides valuable insights into participants’ visual attention and the cognitive processes involved in reading multimodal texts.
Dwell time refers to the cumulative time (in milliseconds) during which a participant’s gaze remains fixed on a specific Area of Interest (AOI) (iMotions, 2022). In this study, AOIs were defined as text and image regions to examine how participants allocated their attention across different components of the reading material. Dwell time provides a measure of attentional focus and helps quantify the cognitive load associated with particular areas; longer dwell times typically indicate increased cognitive effort and processing demands (Cook & Wei, 2019; Holsanova, 2014).
Although dwell time was initially recorded in milliseconds, this study presents dwell time values as percentages. Because the total reading period for each participant was standardized by the fixed duration of the accompanying audio file, expressing dwell time as a proportion of the entire reading interval allowed for more meaningful comparisons between participants. This normalization approach is consistent with methods employed in previous multimodal research (Pellicer-Sánchez et al., 2020), ensuring that variations in absolute reading times did not confound the analysis of attentional allocation.
Fixation count was also analyzed, representing the number of times a participant fixated on a particular AOI. A fixation occurs when a participant’s gaze remains steadily focused on an AOI for approximately 40-200 milliseconds or more (Holmqvist et al., 2011). In this study, the minimum threshold for a fixation was set at 80 milliseconds. The fixation count reflects the level of attention and the effort required to extract information from a specific area. A higher fixation count can suggest that the content was either challenging or particularly salient.
The average fixation duration was then calculated by dividing the total fixation time by the number of fixations within a given AOI (iMotions, 2022). This metric indicates the average time spent per fixation, providing insights into the depth of cognitive processing. While dwell time represents the overall cumulative gaze time on an AOI, average fixation duration focuses on the mean length of each individual fixation event. Longer fixation durations often signify that participants required additional time to interpret the visual information, which may indicate either higher content difficulty or a need for more in-depth comprehension (Liu, 2021; Rayner, 1998).
Lastly, integrative saccades—rapid eye movements made exclusively from a text area to an image area or from an image area back to a text area—were examined, thereby excluding any saccades occurring solely within text regions or solely within image regions. This metric is particularly relevant for understanding how participants integrate different types of information while reading multimodal texts. Frequent integrative saccades in this study indicate active attempts to link textual content with visual elements, reflecting a comprehensive approach to processing the presented material.
To facilitate comparisons of attentional allocation across participants, the dwell time values originally recorded in milliseconds were converted into percentages. Since the reading task was accompanied by a narration of fixed duration, all participants shared the same total reading period. Consequently, expressing dwell time as a proportion of the entire reading duration rather than as an absolute time measure allowed for more meaningful cross-participant and cross-condition comparisons. This proportional approach is consistent with previous research in multimodal reading contexts, where normalization is employed to account for variations in overall reading time and clarify the relative emphasis participants place on different AOIs (Pellicer-Sánchez et al., 2020). By normalizing dwell time in this manner, the analysis provides a clearer understanding of how learners allocate their attention to text and visuals under narrated conditions.
Reading Strategy Identification
The four reading strategies identified in this study—contextual overview reading, synchronized tracking, keyword-triggered switching, and content integration reading—reflect patterns observed in earlier eye-tracking and multimedia learning research. Rather than representing novel constructs, these strategies synthesize findings from studies examining how learners process text and images, integrate auditory input, and navigate lexical challenges.
Contextual overview reading, involving an initial scanning of textual and visual elements to form a preliminary understanding, showed no significant group-level differences in usage rates. Previous studies (Serrano & Pellicer-Sánchez, 2022; Zhao et al., 2014) indicate that learners, regardless of language proficiency, often begin by establishing a global framework before focusing on details. Holsanova (2020) further emphasizes that, in multimodal scientific texts, initial global scanning helps connect linguistic and visual components and activate relevant schemas, facilitating subsequent integrative processing. This foundational strategy may be less sensitive to proficiency differences since it involves broad orientation rather than language-dependent decoding.
Synchronized tracking, informed by research into RWL contexts, involves aligning eye movements with concurrent auditory input. Conklin, Alotaibi, Pellicer-Sánchez, and Vilkaitė-Lozdienė (2020) and Pellicer-Sánchez et al. (2021) observed that auditory cues can guide visual attention, resulting in a synchronization of visual and auditory modalities. Pellicer-Sánchez et al. (2018) demonstrated that learners can effectively synchronize text and visual elements in parallel with auditory cues, optimizing cognitive load management. While this alignment does not guarantee uniform comprehension gains, its prominence suggests a role in coordinating multimodal cues for both L1 and L2 learners.
Keyword-triggered switching characterizes shifts in visual attention to images or glosses upon encountering unfamiliar or critical lexical items. Past research (Serrano & Pellicer-Sánchez, 2022; Warren et al., 2018) found that L2 learners frequently employ this compensatory strategy to resolve lexical difficulties. Van der Sluis et al. (2017) similarly reported that strategically placed images help learners navigate textual comprehension challenges, particularly when dealing with high lexical complexity. Keyword-triggered switching thus represents a bottom-up approach that leverages visual elements to alleviate cognitive load associated with challenging vocabulary.
Content integration reading, involving revisits to images after processing textual segments, aligns with theories of iterative integration such as Kintsch’s Construction-Integration Model (1998). Pellicer-Sánchez et al. (2021) noted that readers return to images to confirm or enrich mental representations formed from text. Holsanova (2020) likewise underscores that iterative engagement with multimodal components refines mental representations, especially in scientific texts. Brosseuk and Downes (2024) highlight metalinguistic awareness as a key factor facilitating such integrative strategies in educational settings. This approach suggests that more proficient readers can fluidly integrate multiple information sources, constructing coherent mental models with relative ease.
Data Analysis Method
Building on the reading strategies defined in Section 2.6, all eyetracking data were processed using iMotions software, and the extracted metrics (e.g., fixation duration, dwell time, saccade counts) were subjected to statistical analysis to examine group differences and relationships with cognitive factors. Data with less than 80% validity were excluded to maintain reliability. After filtering, gaze plots were analyzed to identify reading patterns, followed by the definition of Areas of Interest (AOIs) for text and image regions. Metrics such as fixation duration and integrative saccades were extracted, and appropriate statistical tests were conducted to assess differences between groups (L1 vs. L2) and conditions (text vs. picture).
In addition to these analyses, the four reading strategies (contextual overview reading, synchronized tracking, keyword-triggered switching, and content integration reading) identified from the eye-tracking data were further examined. Each participant’s gaze behavior was coded dichotomously (1=strategy used, 0=strategy not used), translating qualitative patterns into quantitative variables suitable for statistical testing. Chi-square tests determined whether strategy usage frequencies differed significantly between groups or conditions. Logistic regression models identified which learner characteristics (e.g., working memory capacity, proficiency level) predicted the likelihood of employing particular strategies. Correlation analyses explored relationships between strategy usage rates and continuous eye-tracking measures (e.g., fixation duration, integrative saccade counts).
By integrating gaze-based metrics, AOI-specific analyses, and binary-coded strategy usage, a rigorous and comprehensive interpretation of complex multimodal reading behaviors was achieved. This combination permitted a nuanced examination of how individual differences and cognitive factors influenced participants’ adoption and effectiveness of specific reading strategies under narrated multimodal conditions.
RESULTS
Differences in Eye Movements Between Groups in Text and Visual Materials
Building on the methodological framework and reading strategies discussed in previous chapters, this section presents the eyetracking results. As outlined previously, the main metrics (dwell time, fixation count, average fixation duration, and integrative saccades) were analyzed to understand how L1 and L2 learners allocated visual attention across text and image Areas of Interest (AOIs), as well as how these patterns related to identified reading strategies.
To assess group differences, median, mean, and standard deviation values for each metric were calculated, and the Mann-Whitney U-test was employed to evaluate differences between AOIs (text vs. image) and participant groups (L1 vs. L2). Figure 1 shows the average dwell time percentage by AOI type and group, while Figure 2 illustrates the average fixation count by AOI type and group. Figure 3 presents the average fixation duration by AOI type and group, and Figure 4 compares the number of integration sections between L1 and L2 groups. These figures complement the statistical results summarized in the corresponding tables.
Comparison of dwell time (%) between groups in text and visual areas
As shown in Table 2, the L1 group’s mean dwell time for text was 74.71% (SD=10.73), while for pictures it was significantly lower at 15.18% (SD=8.80). The Mann-Whitney U-test indicated that this difference was statistically significant (p<.001), suggesting that native speakers focused significantly more attention on text areas. Figure 1 reflects this trend, with notably higher dwell time percentages for text AOIs among L1 learners.
A similar pattern was observed in the L2 group, where the mean dwell time for text was 79.22% (SD=10.18), compared to 11.64% (SD=9.61) for pictures. The test again revealed a significant difference (p<.001), indicating that both groups devoted more time to text than to pictures. Figure 1 similarly shows elevated dwell time percentages for text AOIs in the L2 group, consistent with the statistical findings.
When comparing the two groups, L1 spent significantly more time fixating on text areas compared to L2 learners (U=358, z=-2.045, p=.04). In contrast, although there was no statistically significant difference in dwell time for picture areas between the groups (U=367, z =-1.92, p=.05), the p-value approached the conventional threshold, suggesting a possible trend. In summary, Table 2 clearly indicates that both L1 and L2 participants primarily focused on text, but L1 participants exhibited a slightly stronger preference for text compared to L2 learners.
Comparison of fixation count between groups in text and visual areas
As presented in Table 3, the L1 group’s mean fixation count for text was 71.70 (SD=29.37), significantly higher than for pictures (13.97, SD=8.24), with the Mann-Whitney U-test showing p<.001. The L2 group showed a similar trend, with a mean fixation count of 125.24 (SD=43.85) for text and 18.88 (SD=17.83) for pictures (p<.001). These results indicate that both groups concentrated more fixations on text AOIs than on visuals, a pattern also visible in Figure 2, where text AOIs register markedly higher fixation counts.
When comparing groups, native speakers had significantly fewer fixations in the text area than L2 learners (U=136.50, z=-5.03, p<.001). No significant difference was found for picture areas (U=489.50, z=-.28, p=.783). This may imply that L2 learners required more frequent fixations on text, potentially reflecting greater cognitive processing demands. Figure 2, in conjunction with Table 3, confirms that while both groups focus on text, L2 learners tend to fixate more often, suggesting higher processing effort.
Comparison of average fixation duration (ms) between groups in text and visual areas
Turning to Table 4, the L1 group’s average fixation duration was 98.22 ms (SD=8.96) for text and 98.84 ms (SD=19.05) for pictures, with no significant difference (p>.05). The L2 group similarly showed comparable durations for text (M=109.04 ms, SD=11.81) and pictures (M=107.71 ms, SD=24.18) (p>.05), suggesting similar visual processing times across AOIs.
However, between groups, native speakers had significantly shorter fixation durations for both text (U=205.00, z =-4.103, p<.000) and pictures (U=343.00, z=-2.247, p=.025) compared to L2 learners. These results, reflected in Figure 3, imply that L1 learners processed information more efficiently, while L2 learners required longer fixations, indicating more effortful processing.
Comparison of integrative saccade count between groups in text and visual areas
As shown in Table 5, the average integrative saccade count between text and picture areas for the L1 group was 5.60 (SD=3.40), while for the L2 group it was 6.82 (SD=5.96). The Mann-Whitney U-test indicated no statistically significant difference between the groups (p=.968), suggesting that both groups exhibited similar eye movement patterns when integrating information across text and visual elements. This implies comparable processing of multimodal information between native speakers and advanced learners. However, as shown in Figure 4, within the L1 group, the variability in integrative saccades was relatively small, indicating consistent eye movement behavior among native speakers. In contrast, the L2 group showed considerable variability in saccade counts between text and picture areas, suggesting substantial differences among individual learners in how they integrated information across modalities.
Multimodal Text Reading Strategies of L1 and L2 Learner Groups
Building on the reading strategies defined in Section 2.6 and following the operational definitions and methodology introduced earlier, this section reports the frequency of strategy use for both L1 and L2 learners. The four strategies—contextual overview reading, synchronized tracking, keyword-triggered switching, and content integration reading—were examined to determine whether significant differences existed between learner groups. Each participant’s gaze behavior was coded dichotomously (1=strategy used, 0=strategy not used), and the frequency of each strategy was expressed as a proportion of the total number of strategies employed by each group. Figure 5 provides a visual comparison of strategy usage frequencies between L1 and L2 learners, allowing for straightforward cross-group and cross-strategy examination.
As previously defined, contextual overview reading involves an initial, broad scanning of text and images to form a preliminary understanding before focusing on details. The results indicated that L1 participants employed this strategy 12.6% of the time, whereas L2 participants used it 16.2% of the time. A chi-square test (χ2(1)=.020, b=.367, p=.888) revealed no statistically significant difference between the two groups, indicating similar reliance on this initial scanning approach.
Synchronized tracking, associated with aligning eye movements to auditory input in RWL conditions, was the most frequently utilized strategy for both groups. L1 participants employed it 52.6% of the time, and L2 participants 42.5% of the time. Although L1 learners showed a slightly higher usage rate, no statistical tests were reported for this particular comparison, suggesting that the observed frequency difference did not reach statistical significance. If future analysis is performed, any resulting test values should be reported consistently.
Keyword-triggered switching is characterized by focusing on images or glosses upon encountering challenging lexical items. The analysis showed that L2 participants employed this strategy more frequently (12.5%) than L1 participants (5.3%). A Wald test confirmed a significant difference between groups (Wald χ²(1)=10.630, b=27.160, p=.001), indicating that L2 learners were more likely to use visual references to manage lexical difficulties. As illustrated in Figure 5, this clear divergence highlights how L2 learners, compared to L1 learners, more actively adjust their gaze to external supports when faced with challenging words.
Content integration reading involves revisiting images after reading segments of text to ensure comprehensive information consolidation. L1 participants employed this strategy 29.5% of the time, and L2 participants 28.7% of the time. Although the observed difference was relatively minor, a chi-square test indicated a significant difference (Wald χ²(1)=3.901, b=-18.436, p=.048), suggesting a subtle yet statistically meaningful distinction in strategy use between the two groups. Figure 5 also reflects these proportions, showing that while both groups engage in content integration, L1 participants do so slightly more frequently.
In summary, while both L1 and L2 participants utilized these four reading strategies, Figure 5 effectively highlights the distinct usage patterns. Keyword-triggered switching and content integration reading showed notable group differences, whereas contextual overview reading and synchronized tracking did not yield statistically significant group-level distinctions. These data align with the previously established strategies and analytical frameworks presented in earlier chapters, providing a clearer understanding of how L1 and L2 learners employ multimodal reading strategies.
Effects of WM on Reading Strategy Use in L1 and L2 Learner Groups
This section presents the results of statistical analyses examining how WM factors influence the usage of the previously identified reading strategies by L1 and L2 learners. The analyses are based on the dichotomous coding of strategy use introduced earlier, aligning the current findings with the methodological framework established in preceding chapters. Figure 6 provides a visual representation of WM effects on reading patterns across L1 and L2 groups.
For contextual overview reading, neither phonological WM (Wald χ²(1)=.005, b=.007, p=.942) nor visual WM (Wald χ²(1)=.596, b=-.002, p=.440) had a significant main effect on the use of this strategy. However, a significant interaction was found between phonological WM and group (Wald χ²(1)=4.044, b=.015, p=.044), indicating that increased phonological WM was associated with a decrease in the use of contextual overview reading in the L1 group, while no such trend was observed in the L2 group. Visual WM exhibited no significant interaction with group (b=.002, p=.208), suggesting a similar influence across both groups.
For synchronized tracking, the L2 group consistently used this strategy across all texts, resulting in a lack of distinct data points suitable for statistical differentiation. Therefore, further analysis could not be conducted to determine group differences in this strategy’s usage.
Regarding keyword-triggered switching, both phonological and visual WM had significant effects on strategy use. Phonological WM exhibited an inverse relationship with the frequency of this strategy—higher levels of phonological WM were associated with a reduced use of keyword-triggered switching (Wald χ²(1)=13.354, b=-.037, p<.001). A similar pattern was found for visual WM, where increased levels also corresponded to decreased use of the strategy (Wald χ²(1)=10.464, b=-.009, p=.001). Interaction analyses revealed a significant interaction between group and phonological WM (Wald χ²(1)=10.506, p=.001), indicating differing effects between groups. In the L1 group, higher phonological WM was linked to a decreased usage frequency of keyword-triggered switching (b=-.058, p=.001), while this trend was absent in the L2 group. Additionally, a significant interaction was found between visual WM and group (Wald χ²(1)=6.950, p=.008). In the L1 group, increased visual WM was associated with reduced use of the strategy (b=-.009, p=.008), whereas negligible variation was found in the L2 group, suggesting a group-specific effect of visual WM on keyword-triggered switching.
For content integration reading, neither phonological WM (Wald χ²(1)=2.853, b=.016, p=.091) nor visual WM (Wald χ²(1)=1.716, b=.004, p=.190) had a significant main effect on strategy use. However, a significant interaction was identified between phonological WM and group (Wald χ²(1)=4.318, p=.038), indicating that increased phonological WM in the L1 group (b=.004, p=.038) was linked to more frequent use of content integration reading. This trend was not observed in the L2 group. No significant interaction was detected between visual WM and group (Wald χ²(1)=1.320, p=.251), suggesting that the influence of visual WM on this strategy was consistent across both groups.
DISCUSSION
This chapter interprets the findings regarding the multimodal reading strategies and eye movement patterns of L1 and L2 learners, drawing on the theoretical frameworks and empirical evidence discussed in earlier. By examining how both groups allocate attention to textual, visual, and auditory inputs, and how WM capacity interacts with their strategy use, the present results offer insights into the cognitive processes underpinning multimodal comprehension.
RWL Multimodal Text Reading: Comparing Eye Movements of L1 and L2 Groups
The results indicated that both L1 and L2 learners prioritized text over visuals, devoting a substantial portion of their reading time to textual regions even under RWL conditions. Although earlier theoretical discussions (e.g., Mayer, 2009; Paivio, 1986) suggested that integrating multiple modalities can enhance comprehension, the current findings align with research (e.g., Serrano & Pellicer-Sánchez, 2022) showing that readers predominantly focus on text despite available auditory input. Similarly, Lim and Kim (2024) reported that while learners do attend to visuals in multimodal contexts, their primary reference point often remains textual content.
In the present study, the gap in attention allocation between text and visuals was more pronounced among L2 learners, suggesting that the addition of auditory support did not fully alleviate their cognitive demands. This observation resonates with Holmqvist et al. (2011) and Conklin and Pellicer-Sánchez (2016), whose work indicated that L2 readers tend to make more frequent and prolonged fixations, reflecting heightened cognitive effort. The observed fixation counts and durations support the notion that L2 learners invest additional processing resources, likely due to the challenges of simultaneously decoding linguistic material and interpreting corresponding images. Although previous literature (Chang & Choi, 2014; Pellicer-Sánchez et al., 2020) highlighted potential benefits of multimodal input for L2 learners, the current results suggest these benefits may be tempered by the complexity of integrating multiple information sources within limited cognitive capacity.
Moreover, while auditory cues have been proposed to facilitate more efficient integration of text and visuals (Conklin et al., 2020; Pellicer-Sánchez et al., 2021), the findings here indicate that L2 learners continue to struggle in balancing attention across modalities. The increased variability in L2 learners’ fixation patterns observed in this study aligns with the variability reported by Lim and Kim (2024), implying that L2 learners may not uniformly gain from added auditory support. Instead, their attentional distribution may remain sensitive to additional cognitive burdens, including increased lexical complexity or visual detail.
In contrast, L1 learners demonstrated relatively stable and efficient processing, consistent with the view that native speakers can integrate textual and non-textual elements without incurring substantial cognitive overload (Zhao et al., 2014). Their reduced fixation durations and fewer fixations suggest more automatic processing of multimodal input, aligning with theoretical predictions of lower extraneous cognitive load when linguistic decoding is less demanding (Mayer, 2009; Sweller, 1994).
Taken together, these outcomes confirm persistent challenges for L2 learners in RWL multimodal contexts and reinforce the notion that modality integration is more cognitively taxing for those with less automated language processing skills. These findings highlight the need to examine WM and strategy use more closely.
Differences in Reading Strategies Between L1 and L2 Groups
Building on the analytical framework of four key strategies— contextual overview reading, synchronized tracking, keyword-triggered switching, and content integration reading—this section interprets the observed group differences. Contextual overview reading, involving an initial scanning of textual and visual elements, showed no significant group-level differences (χ²(1)=.020, p=.888). Both L1 and L2 learners employed this top-down approach at similar rates. This result aligns with earlier findings (Serrano & Pellicer-Sánchez, 2022; Zhao et al., 2014), indicating that learners commonly establish a global framework regardless of proficiency. Holsanova (2020) emphasizes that initial global scanning in multimodal reading helps connect linguistic and visual components, allowing schema activation before detailed processing. The absence of proficiency-related differences suggests that contextual overviewing may function as a foundational strategy used similarly by both groups.
Synchronized tracking, in which eye movements align with auditory input, was the most frequently employed strategy for L1(52.6%) and L2 (42.5%) learners. Although no statistical test was reported for this difference, its prominence suggests that auditory cues effectively guide visual attention in real time (Conklin et al., 2020; Pellicer-Sánchez et al., 2021). According to Mayer’s Cognitive Theory of Multimedia Learning (2003), well-coordinated audiovisual input supports efficient resource allocation, potentially explaining the high utilization rates. Brosseuk and Downes (2024) further highlight that multimodal alignment in educational settings enables learners to synchronize diverse input channels. While previous research (e.g., Pellicer-Sánchez et al., 2020) indicates that such alignment does not necessarily yield uniform comprehension gains, the widespread use of synchronized tracking underscores its role in coordinating multimodal cues.
Keyword-triggered switching, wherein learners shift attention to images or glosses upon encountering challenging lexical items, occurred more often among L2 learners (12.5%) than L1 learners (5.3%; Wald χ²(1)=10.630, p=.001). This pattern supports prior evidence (Serrano & Pellicer-Sánchez, 2022; Warren et al., 2018) that L2 readers depend on localized visual cues to manage lexical difficulties. Van der Sluis et al. (2017) similarly reported that strategically placed visuals aid learners in navigating textual complexity. Thus, keyword-triggered switching reflects a bottom-up approach, where L2 learners leverage visual supports to reduce cognitive load associated with challenging vocabulary.
Content integration reading, which entails revisiting images after reading text segments, revealed a marginal but meaningful difference (L1: 29.5%, L2: 28.7%; χ²(1)=3.901, p=.048). Although both groups employed this strategy, L1 learners did so slightly more frequently. This pattern resonates with Kintsch’s Construction-Integration Model (1998), which suggests that more proficient readers integrate multiple information sources more fluidly. Zhao et al. (2014) similarly noted that skilled readers navigate multimodal content with relative ease, and Holsanova (2020) observes that iterative engagement with multimodal components refines mental representations. The subtle yet statistically meaningful difference in usage may reflect L1 learners’ relatively more automatic processing.
In summary, while both L1 and L2 learners utilized all four strategies, certain patterns stood out. Advanced L2 learners relied more on keyword-triggered switching to handle lexical challenges, whereas L1 learners engaged slightly more in content integration reading. At the same time, both groups shared foundational approaches like contextual overview reading and demonstrated a high usage rate of synchronized tracking. Referencing established theories (Mayer, 2003; Paivio, 1986) helps clarify how native speakers and advanced L2 learners negotiate complex multimodal texts under RWL conditions.
Influence of WM on Reading Strategies
Building on prior work (Lim & Kim, 2024) that examined L1 and L2 learners in text-only and text-and-image contexts, this section explores how WM factors shape four identified reading strategies— keyword-triggered switching, content integration reading, synchronized tracking, and contextual overview reading—under RWL conditions.
Keyword-triggered switching, revealed distinct patterns across L1 and L2 groups. Among L1 learners, increased phonological WM capacity corresponded to a reduced reliance on this strategy (χ²(1)=13.354, p<.001), aligning with the phonological loop’s role in verbal processing (Baddeley & Hitch, 1974). This result also parallels findings that L1 learners effectively process text without heavily depending on visual aids when WM is robust (Lim & Kim, 2024). By contrast, L2 learners continued to use keyword-triggered switching irrespective of WM capacity, reflecting persistent lexical or syntactic demands (Linck et al., 2014; Pellicer-Sánchez et al., 2021).
Content integration reading, showed significant WM-related effects for L1 learners, with increased phonological WM linked to more frequent use of this strategy (Wald χ²(1)=4.318, p=.038). This outcome supports the Construction-Integration Model (Kintsch, 1998), suggesting that skilled readers iteratively merge textual and visual inputs. In contrast, no significant WM effect emerged for L2 learners, indicating that cognitive limitations persist even when auditory input is introduced. L1 learners, however, appear to capitalize on WM resources to navigate multimodal content more efficiently (Lim & Kim, 2024).
Synchronized tracking, which aligns visual attention with concurrent auditory input, was employed by L2 learners at a consistent rate, regardless of WM capacity. This stability implies that L2 readers rely on such tracking primarily to maintain engagement rather than reduce language-processing strain (Kim et al., 2023). Mayer’s Cognitive Theory of Multimedia Learning (2003) also posits that multimodal input distributes cognitive load, an idea further supported by Brosseuk and Downes (2024). Similar benefits for L2 learners have been noted in other RWL contexts (Lim & Kim, 2024).
The current RWL-focused analysis builds on previous work (Lim & Kim, 2024) by introducing an auditory dimension to textonly and text-and-image conditions. The results underscore how the additional modality influences strategy use, especially for L2 learners who exhibit heightened variability in attention and a stronger reliance on keyword-triggered switching. However, the absence of a non-audio, independent reading condition (Kim et al., 2023) limits attributing the observed patterns solely to RWL demands.
Limitations and Future Research
Despite its contributions, this study has several limitations that suggest directions for future research. A relatively small sample size restricts the generalizability of the findings, highlighting the need for larger and more diverse participant groups. Moreover, individual learner variables—such as language proficiency, native language background, and cultural factors—were not thoroughly examined, even though these attributes likely influence attention allocation and strategy use. Investigating these variables in subsequent research may clarify whether the observed patterns are universal or contingent upon specific learner characteristics.
This study employed a single type of auditory input—narrated texts by a native speaker. Exploring variations in auditory modalities, including synthetic speech, accented speech, or differing speech rates, could uncover how these factors influence multimodal integration, particularly for L2 learners. Furthermore, the experimental design focused exclusively on the RWL condition without comparisons to alternative presentation formats, such as purely textual content or text-and-image conditions without narration. Systematic comparisons across diverse formats, as noted by Kim et al. (2023), would help isolate the unique contributions of auditory support and its interaction with visual inputs.
The absence of a comprehension assessment also limits the ability to directly link the observed reading strategies to measurable understanding. Without direct comprehension measures, it remains unclear whether identified strategies enhance overall reading outcomes. Incorporating comprehension assessments, as recommended by Pellicer-Sánchez et al. (2021), would provide more robust evidence concerning the effectiveness of these multimodal strategies.
Addressing these limitations in future research can advance understanding of multimodal reading processes and the cognitive and strategic adaptations of L1 and L2 learners. Such work may also refine theoretical models, including the Working Memory Model (Baddeley & Hitch, 1974) and the Construction-Integration Model (Kintsch, 1998), by revealing how different modalities support reading comprehension.
Conclusion
This analysis examined L1 and advanced L2 learners’ processing of multimodal texts under RWL conditions, focusing on attention allocation, reading strategies, and WM influences. While prior work has explored text-plus-image or reading-only contexts, the present findings illustrate the added complexity that audio introduces for proficient L2 readers. Unlike studies showing consistent audio-related benefits among younger L1 learners (e.g., Kim et al., 2023), the data here reveal that advanced L2 learners may still encounter challenges in seamlessly integrating text, images, and audio.
By identifying a continued reliance on compensatory strategies and revealing the nuanced role of WM in strategic reading, the analysis provides deeper insight into second language multimodal literacy. In practice, these results call for instructional materials and methods that address not only linguistic proficiency but also cognitive demands, thereby guiding advanced L2 learners toward more integrative reading strategies. Ultimately, understanding the intersection of WM, proficiency, and modality integration is crucial for refining theoretical perspectives on L2 comprehension and for developing more effective, evidence-based approaches to teaching multimodal reading skills.
Notes
This reliability threshold for valid data was set following Holmqvist et al. (2011), who recommend achieving over 80% valid data to ensure accuracy in eye-tracking studies.
The argumentative essay was excerpted from a recommended writing textbook for middle school students in grades 1–2 (level 7), focusing on topics such as “The Necessity of Leisure Time for Adolescents” and “Reducing Smartphone Usage.” This essay clearly presents its claims and supporting evidence, incorporating typical discourse markers like “first” and “second,” which are characteristic features of argumentative texts. The explanatory text was drawn from the third-year middle school science textbook, Atmosphere and Weather, and focuses on topics like “How Clouds Form” and “How Clouds Become Raindrops,” divided into three paragraphs to provide detailed scientific explanations. The narrative text was adapted from reading materials in Learning Korean through Traditional Fairy Tales, specifically “The Lazybones Who Became a Cow” and “The Three-Year Hill.” These texts feature protagonists whose actions unfold over time, with modifications to vocabulary and content to suit the learners’ proficiency level.
Although the phonological WM task was conducted in Korean, similar approaches have been validated in studies with Korean language learners (Baik, Kim, & Lee, 2016; Lim & Kim, 2024). In addition, O’Brien, Segalowitz, Freed, and Collentine (2007) suggest that L2-based phonological WM measures may provide meaningful insights into learners’ linguistic capabilities, indicating that employing Korean-based tasks with TOPIK 6-level learners need not be inherently disadvantageous. The selected words were common, short, and clearly pronounced, thereby minimizing extraneous cognitive load. Furthermore, these words were chosen from high-frequency nouns identified by Kang and Kim (2009) at the beginner level of the test of proficiency in Korean (TOPIK). The stimuli varied from one to three syllables in length, exemplified by 산 (san, ‘mountain’) as a one-syllable word, 학교 (hakgyo, ‘school’) as a two-syllable word, and 선생님 (seonsaengnim, ‘teacher’) as a three-syllable word.
The arithmetic problems were designed to be solvable through mental calculations and were evenly divided between true and false equations.
Although any intelligible pronunciation was accepted, responses in which the target word was not accurately recalled, resulting in a mispronunciation, were assigned .5 points. For example, if 산 (san, ‘mountain’) was ambiguously pronounced as 상 (sang, ‘prize’), both native speakers and learners received .5 points. This scoring procedure is consistent with that employed by Lim and Kim (2024).
While the 150 Hz frequency of the Gazepoint-GP3 may not offer the highest level of precision available, previous studies, such as Pellicer-Sanchez et al. (2021), have effectively used eye-trackers with frequencies of 120 Hz for research on reading and visual attention, demonstrating sufficient reliability for capturing key eye movement metrics in educational experiments.