Why large language models won't save spaced repetition (but how they might help)

Updated on 25/06/2025

Sketches are thoughts I’m still chewing on, early explorations that might grow or get tossed instead.

The friction of creating good flashcards has long been the primary impediment to the widespread adoption of spaced repetition. So when large language models arrived, the seemingly obvious application was to automate this bottleneck away. The dream of feeding a textbook into a model and receiving a perfectly formed deck of Anki cards is a powerful one. It is also a trap.

The first-order failure of this approach is obvious to anyone who has tried it. The naive “upload-a-doc” tools churn out cards that are voluminous but useless, a problem I call the automation-relevance tradeoff. An LLM, lacking any model of your existing knowledge, interests, or specific learning goals, cannot possibly know which facts are trivial and which are revelatory. The inevitable result is a deluge of irrelevant cards.

A more sophisticated approach might attempt to solve this by introducing a human-in-the-loop filtering step, the idea being that you sacrifice some automation so that the resulting cards are more relevant to you. I explored this path when I built Memoria, a prototype that first used an LLM to decompose a text into discrete topics, then presented them to me for selection before generating the final cards. This improved relevance, but in doing so, it revealed a far more insidious problem: making flashcard creation too cheap fundamentally corrupts the practice.

When the cost of creating a card approaches zero, you create too many of them. The momentary curiosity you feel about a tangential fact can be indulged with a single click, but the cognitive and, more importantly, the motivational cost is paid later, during review. Your spaced repetition deck progressively engorges with what Michael Nielsen calls “orphan cards”, questions that are poorly integrated with your overall interests and knowledge, and whose emotional valence at the time of creation has long since evaporated, turning daily reviews into a tedious slog through accumulated trivia.

More insidiously, frictionless creation fosters metacognitive laziness. It allows one to use a large volume of cards as a crutch for an undeveloped conceptual schema of the material. The illusion is that as long as the information is captured in a flashcard, understanding will magically materialize through the brute force of spaced repetition.

Piotr Wozniak has been drumming on this point for close to three decades, emphasis on the original :

Before you proceed with memorizing individual facts and rules, you need to build an overall picture of the learned knowledge. Only when individual pieces fit to build a single coherent structure, will you be able to dramatically reduce the learning time.

This does not mean LLMs have no role to play. After spending the last couple of years trying to apply large language models to flashcard creation, my view is that the proper place for these models is not in the creation of flashcards, but in their refinement.

The generative act of wrestling with a concept until it can be distilled into a precise, atomic question is a strong enabler of durable learning. This struggle, and it is a struggle, is not a bug to be optimized away; it is the entire point. The act of creating atomic flashcards forces the active recall and synthesis that a passively consumed, AI-generated card circumvents. It is how your confusion is laid bare and it forces you to create the coherent mental structures that allow for further knowledge to be bolted on top in the future.

Once this initial step of manually crafting a quality flashcard is complete, large language models can now be productively invoked, but instead of having them generate the flashcard, they are now used to enhance and refine cards after their creation.

An awkwardly phrased question can be instantly sharpened. A verbose answer can be trimmed without losing precision. Most powerfully, a single well-formed card can seed an entire family of related cards: the original question and answer pair can be quickly and cheaply turned into a cloze deletion, they can swap places to test bidirectional recall, or the model may suggest a number of permutations that probe the same concept from different angles.

These last examples might seem contradictory to the points made previously. Wasn’t the whole point that large language models should not be writing flashcards? Yet there is no contradiction. The key difference is that we already did the hard work of formulating the original card. The permutations amount to nothing but mechanical variations on that initial card. Once you’ve written a good card, letting the model generate cloze deletions, reverse Q&A versions, or rewrite it in various different ways to emphasize different aspects isn’t cutting corners, it’s offloading busy work. You’re not asking the AI to decide what matters, you’ve already done that. You’re just asking it to help you practice that knowledge from more angles without having to spend time creating those cards.

One way I have started to employ these insights practically is through Obsidian, with the Copilot and Obsidian to Anki plugins. With a flashcard selected, which I write within Obsidian at the same time I am writing down my thoughts on what I am reading, I can easily trigger custom commands that send the flashcard to a LLM as context for the latter to, for example, classify my card’s quality and, in the case where it deems the flashcard to be mediocre, offer concrete suggestions for improvement. I can also just as easily, with a single command, have a LLM generate permutations of that flashcard, or turn it into a cloze deletion.

I should clarify what I am actually doing when I write flashcards in Obsidian. I am not simply extracting isolated facts or copying down free-floating information from the source text. Creating flashcards is embedded in a broader process of reflection and synthesis: I am reading, writing, connecting ideas - building understanding - and, from that same act of sensemaking, I generate flashcards that distill the higher-order insights I’ve just worked through. These are not shallow knowledge fragments, but compressed forms of thought that encode my own nascent conceptual structure of the material. The goal isn’t just to retain a name, a date, or a definition, but also to ensure that the understanding I’ve constructed doesn’t evaporate with time and can be built upon and applied later on.

The application of spaced repetition to encode durable, conceptual understanding is vastly underexplored. It is difficult to find good examples, let alone guidance, on how to use flashcards this way. Andy Matuschak and Michael Nielsen stand out as rare exceptions. Because language learners and medical students dominate the discourse around spaced repetition, most users inherit a narrow and impoverished conception of what it’s for and what it can be used to accomplish. The examples they encounter anchor their understanding of the potential of spaced repetition to the memorization of shallow, non-conceptual knowledge, crowding out more ambitious applications. The method becomes defined by its most superficial use cases.

Ultimately, the naive application of large language models to mass-produce flashcards is a project born of a fundamental misunderstanding of how learning happens. It mistakes the accumulation of isolated facts for the construction of knowledge. One does not build the Parthenon by haphazardly piling stones; a coherent architecture must precede and guide the placement of every block. In the same way, durable cognitive maps are not formed by the brute-force rote memorization of disconnected data points.

This is not a rejection of spaced repetition or the application of large language models to support it. I am merely suggesting a crucial clarification of their roles.

Spaced repetition remains an unparalleled tool for reinforcing memory, but it requires a pre-existing structure upon which new information can be hung. It can be used to bootstrap a nascent schema, but it cannot alone create it. Large language models, in turn, find their proper place not as surrogates for the difficult and demanding intellectual labor of identifying key concepts and distilling them down into a clear and enlightening flashcards, but as tireless assistants to which you offload rote and mechanical work, freeing you to proceed with the actual work of developing genuine comprehension.