Why readers leave the page - and what a comprehension tool should do instead

Colm O'Connor

AI Augmented Reading · colm@aiaugmentedreading.com

Published June 2026

Abstract

Readers of dense, technical text often leave the page in order to understand it, searching for a diagram or an illustrated explanation before returning to where they left off. This article argues that this common behaviour points to a distinct class of reading-support tool - a “comprehension tool” - that is worth distinguishing from the productivity tools, such as summarisers and writing assistants, that dominate most discussion of artificial intelligence and reading. Four defining properties are proposed, the idea is set against work that already exists, and a distinction is drawn that I take to be central: between delivering more words to a reader who has not understood the prose and delivering a generated visual instead. The interaction itself is not new, and several systems already share it; what those systems deliver, though, is words - highlights, annotations, written answers - and not a generated picture. I am not aware of any system that generates an image inline, in response to a reader's selection, as an aid to comprehension; the open question is whether doing so - supplying the picture rather than more text - measurably improves understanding.

Keywords reading comprehension · comprehension tools · information literacy · generative AI · augmented reading

Request PDF →

Introduction

A reader working through an unfamiliar technical passage will often do something the text never asked of them: they stop, open a second window, and look for a visual. A student reading a description of a chemical reaction searches for the mechanism as a diagram. Someone working through a statistics chapter looks for a worked, visual example of the distribution being described. A reader faced with a particularly dense paragraph seeks out an illustrated explanation, or a short video, and then comes back to the text. The page is not abandoned; it is supplemented, on the reader's own initiative, with a picture the reader believes will make the next sentence make sense.

In the course of supporting learners working with dense academic material I have seen this happen often, and I think it is worth taking seriously rather than treating as a quirk of how people study. That readers reach for a picture is not, in itself, surprising. There is a long line of research in educational psychology showing that well-chosen visuals alongside text improve comprehension and retention compared with text on its own (Mayer and Gallini, 1990; Mayer, 2009; Carney and Levin, 2002), an effect confirmed across many studies by meta-analysis (Guo et al., 2020). What the behaviour tells us is more specific. For a significant class of text, the reading interface as delivered is not sufficient for understanding, and readers know it well enough to work around it - at a cost. Leaving the page breaks the reader's place, adds a search task, and usually returns something generic rather than something fitted to the exact sentence that caused the difficulty.

This article takes that everyday workaround as its starting point. I want to suggest that it points to a kind of reading-support tool that has not been clearly named, even though pieces of it have already been built, and that naming it helps us see both what such a tool should do and how we might tell whether it works.

A category worth naming

Most of the current conversation about artificial intelligence and reading is really about productivity. A summariser compresses a document so the reader can avoid reading all of it. A writing assistant drafts, rewrites and edits. A “copilot” answers questions about a text and pulls out the key points. In each case the tool is doing well when it lets the user produce something - a summary, a draft, an answer - more quickly, and the unspoken ideal is often that the reader need not engage with the underlying text at all.

A tool of the kind I am describing does the opposite. Its job is to help a reader understand a particular passage that they have chosen to read and intend to keep reading. It does not summarise the text away or read it on the reader's behalf. It steps in at the moment of difficulty, supplies something that makes the hard passage clearer, and then hands the reader back to the text. What is being optimised is the reader's understanding of the material, not the speed of producing something else from it. It seems useful to give this kind of tool its own name, and I will call it a comprehension tool.

The distinction is more than a label, because the two kinds of tool pull in opposite directions. A productivity tool is rewarded for letting the reader skip the text; a comprehension tool is rewarded for keeping the reader in it. A productivity tool usually works on a whole document in order to transform it; a comprehension tool works on a small, reader-chosen span in order to illuminate it, and then gets out of the way. The output of a productivity tool is the deliverable the user keeps; the output of a comprehension tool is scaffolding, useful only until understanding arrives and then discarded. None of this is a criticism of productivity tools, which are genuinely useful. The point is only that comprehension is a different goal, and that it has had far less explicit attention.

What defines a comprehension tool

I would suggest four properties. Each one does real work: take any single one away and the tool turns back into something we already have.

Reader-initiated: The reader asks for help, at a point of their own choosing, rather than the system deciding for them where help is needed. The reader is the one who knows which sentence has tripped them up. Take this away and you have an automatic enrichment layer that competes with the text for attention.
At read time: The help arrives during reading, at the moment of difficulty, not as a separate preparatory step beforehand or a study exercise afterwards. Take this away and you have a document-preparation or study-guide tool, which is a different thing.
Inline: The help appears within the reading surface, next to the passage that prompted it, rather than in a separate window or chat thread. This is the property most directly suggested by the workaround, because the whole cost of leaving the page is that it is not inline. Take this away and you have a reading-adjacent chatbot.
Position-preserving: The reader does not lose their place, scroll away, or navigate elsewhere; the help is delivered and then dismissed without displacing where they were. Take this away and you have reintroduced exactly the navigational cost the tool was meant to remove.

Put together, these four describe a tool that meets a need the reader actually feels, at the moment they feel it, where they feel it, without making them pay the costs that the everyday workaround imposes.

What already exists

It is important to be clear that the basic interaction here is not new, and I do not want to suggest otherwise. Several research systems have already shown that text can be augmented on demand, within the reading surface, in response to what a reader selects. Work on intelligent reading support for scientific papers surfaces salient content inline, with the reader able to control where (Fok et al., 2023). A browser tool generates annotations on a manuscript directly where the reader is working (Díaz et al., 2024). And a recent interaction-design thesis prototyped almost exactly the interaction I have described: on highlighting a passage and choosing a prompt, an AI response is inserted beneath the selection, the surrounding text reflowing to make room, with a control to collapse it again and restore the original (Melin-Higgins, 2024).

What each of these delivers, though, is words. Scim highlights text that is already on the page; the browser tool attaches written annotations; the thesis prototype inserts a written response. This is the difference I want to draw out, because I think it matters more than it first appears. A reader who has not understood a passage of prose is, in each case, handed more prose - an explanation, an annotation, an answer - in the same verbal mode that defeated them to begin with. The behaviour that prompted this article is different in kind: the reader goes looking for a visual. The evidence that visuals aid understanding (Mayer and Gallini, 1990; Carney and Levin, 2002; Guo et al., 2020) is evidence about a change of mode, from words to image, and not about supplying more words.

Two very recent studies bear on the idea and are worth noting honestly. A controlled investigation of AI help anchored to text found that readers preferred help integrated with the document over a separate chat window, and preferred selecting the text themselves over having it chosen for them (Joshi and Vogel, 2026) - encouraging for two of the four properties. The same study found no measurable effect on comprehension; but its help, too, was textual, and its passages short and undemanding, so it leaves the visual question untouched. Separately, an augmented reading interface that links readers to figures and tables already present in a document improved reading-quiz scores without adding time or cognitive load (Hwang et al., 2026) - though it surfaces visuals the document already contains rather than generating one where the prose has none.

The gap this paper identifies

I am not aware of any existing system that generates an image, inline, in response to a reader's selection, as an aid to understanding the passage that prompted it. The components are not new - images can be generated, content can be placed inline, a selection can trigger it - and I make no claim to having originated any of them. What I have not found is a system that brings them together.

What we still need to find out

Naming a category is only useful if it can be tested, and the central question is a plain one: does a comprehension tool of this kind actually improve comprehension of dense material? This is not the same question as whether it is fast, or well-liked, or frequently used; a tool can be all three and still make no difference to understanding, or even substitute a plausible-looking picture for genuine understanding.

The way to find out is not elaborate. It is to compare how well readers understand a genuinely difficult passage under three conditions: reading it unaided, reading it while free to leave the page and search the web for a supporting image, and reading it with an inline comprehension tool. Comprehension itself - measured with questions that test understanding and transfer, not mere recall - is the outcome that matters, and the passage has to be hard enough that there is room to improve. The comparison with the everyday workaround is the important one, because that workaround already delivers some benefit; the tool has to do better than the reader leaving the page, not merely better than nothing.

It is worth saying plainly that the answer might be negative. The study noted above (Joshi and Vogel, 2026) found no comprehension effect from a related, if weaker, intervention. A clear negative result would itself be useful, because it would tell us where the effort is, and is not, worth spending. The value lies in having an answer, not a particular answer.

Why this matters for those who support readers

For those of us whose work involves helping people read and understand - librarians, those who teach information literacy, anyone who supports learners through difficult material - this is not an abstract question. We already see readers struggling with dense text and improvising their own visual aids, and we already know from the wider literature that good visuals help. A tool that meets that need at the point of difficulty, without pulling the reader off the page, would sit naturally alongside the support we try to provide. Equally, if such tools turn out not to help, that is something we ought to know before recommending them.

I offer the idea of a comprehension tool, then, not as a finished theory and not as a claim to have invented the underlying interaction - the selection of text triggering an inline response - which several systems have already demonstrated. What those systems deliver is words: highlights, annotations, written answers. I offer this framework as a way of organising a space that is currently described mostly in the language of productivity, and of naming precisely what has not yet been attempted: putting a generated image, rather than more prose, in front of the reader at the moment of difficulty - precisely enough that we can design for it deliberately, and test whether it works.

References

Carney, R.N. and Levin, J.R. (2002) ‘Pictorial illustrations still improve students’ learning from text’, Educational Psychology Review, 14(1), pp. 5-26.
Díaz, O., Garmendia, X. and Pereira, J. (2024) Streamlining the review process: AI-generated annotations in research manuscripts. arXiv:2412.00281.
Fok, R., Kambhamettu, H., Soldaini, L., Bragg, J., Lo, K., Hearst, M.A., Head, A. and Weld, D.S. (2023) ‘Scim: intelligent skimming support for scientific papers’, in Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI '23). ACM, pp. 476-490.
Guo, D., Zhang, S., Wright, K.L. and McTigue, E.M. (2020) ‘Do you get the picture? A meta-analysis of the effect of graphics on reading comprehension’, AERA Open, 6(1).
Hwang, A., Kambhamettu, H., Yang, Y., Patel, A., Chang, J.C. and Head, A. (2026) Connecting the dots: surfacing structure in documents through AI-generated cross-modal links. arXiv:2602.16895.
Joshi, N. and Vogel, D. (2026) ‘Designing and evaluating AI margin notes in document reader software’, in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26). ACM.
Mayer, R.E. (2009) Multimedia learning. 2nd edn. Cambridge: Cambridge University Press.
Mayer, R.E. and Gallini, J.K. (1990) ‘When is an illustration worth ten thousand words?’, Journal of Educational Psychology, 82(4), pp. 715-726.
Melin-Higgins, L. (2024) AI in document interaction: an interaction design approach to enhancing reading practices. Bachelor thesis. Malmö University.