1. What temporal incoherence is and why it matters 2. How current AI video tools hit their limit 3. The EPFL approach: how they solved drift 4. What ICLR 2026 acceptance means for credibility 5. Minutes-long AI video: what is now possible 6. Commercial implications for Sora, Veo, and Kling 7. Hollywood and professional content creation impact 8. Remaining technical challenges 9. The competitive landscape after this paper 10. What to expect in the next 12 months 11. Frequently asked questions ---

EPFL solves the drift problem that capped AI video at a few…

TL;DR: Researchers at EPFL have published a paper accepted to ICLR 2026 that resolves temporal incoherence — the "drift" problem — in generative video models. The breakthrough removes the practical length ceiling that has constrained tools like Sora, Veo, and Kling to clips of a few seconds without coherence collapse. The technique enables minutes-long AI video with consistent characters, stable environments, and coherent motion, fundamentally changing what is commercially possible in AI-generated content.

What you will learn

What temporal incoherence is and why it matters
How current AI video tools hit their limit
The EPFL approach: how they solved drift
What ICLR 2026 acceptance means for credibility
Minutes-long AI video: what is now possible
Commercial implications for Sora, Veo, and Kling
Hollywood and professional content creation impact
Remaining technical challenges
The competitive landscape after this paper
What to expect in the next 12 months
Frequently asked questions

What temporal incoherence is and why it matters

If you have ever used Sora, Veo, Kling, or any other AI video generation tool, you have seen temporal incoherence even if you did not have a name for it.

It is the moment, around the three-second mark on a typical clip, when a character's hand changes shape between frames. Or the background color shifts slightly every few frames without any lighting reason. Or a person's face slowly morphs across a ten-second clip until they barely resemble who they were at the start. Or an object disappears entirely and reappears a different size. These artifacts are not bugs in the traditional sense — the model is not malfunctioning. They are the direct consequence of how diffusion-based video models generate frames.

Temporal incoherence is the technical term for a generative video model's inability to maintain consistent representations across frames over time. The longer the video, the more the model's internal representation of what it is generating "drifts" away from its earlier state. Characters lose consistent features. Environments shift subtly but accumulate into gross inconsistency. Physics breaks down — objects that cast a shadow in frame one cast no shadow in frame thirty.

The reason this matters is not aesthetic. It is economic and practical. Temporal incoherence is the primary reason commercial AI video tools cap their generations at 4–10 seconds. It is not a computational constraint in the sense of needing more GPU time. It is a coherence constraint. Running a model longer does not produce a longer coherent video — it produces a short coherent video followed by increasingly hallucinatory noise. The only way to produce longer content with existing tools is to stitch multiple short clips together and hope the transitions are not too jarring. For content creators, this is the defining limitation of the entire category.

How current AI video tools hit their limit

To understand what EPFL solved, you have to understand what causes drift in the first place.

Modern AI video generators are built on diffusion models adapted for temporal sequences. The fundamental architecture works by predicting a sequence of frames conditioned on a text or image prompt. At generation time, the model maintains an internal representation — a latent state — of what it is generating. For the first few frames, this representation is tightly anchored to the conditioning signal. The model "knows" what the video is about and produces frames consistent with that knowledge.

As generation extends, the latent state must carry information forward frame by frame. In practice, this forward propagation is imperfect. Each new frame is generated with some stochastic noise inherent to diffusion sampling. Over multiple frames, the small errors introduced by that noise compound. The cumulative effect is that the model's internal state at frame thirty is measurably different from its state at frame one, even though the conditioning prompt has not changed.

This compounding error is the drift. It is a mathematical inevitability of the autoregressive or semi-autoregressive frame generation approach that underlies virtually all current commercial video models. OpenAI's Sora, Google's Veo, Kuaishou's Kling, and Runway's Gen-3 all face this same fundamental constraint. Their teams have applied various engineering patches — careful latent space normalization, attention to specific temporal layers, careful tuning of noise schedules — but none have solved the underlying problem. They have only slowed the rate of drift, which is why they can produce coherent clips of five to eight seconds rather than two to three, but still cannot produce coherent clips of two minutes.

The ceiling is baked into the architecture. Or rather, it was.

The EPFL approach: how they solved drift

The EPFL paper introduces a mechanism that researchers describe as temporal anchoring with periodic coherence reinforcement. The core idea challenges the assumption that video generation must be a purely forward process.

Standard video diffusion generates frames sequentially, with each frame's latent representation conditioned only on previous frames and the original prompt. The problem is that the original prompt becomes diluted as a conditioning signal the further generation progresses. By frame fifty, the model's primary conditioning is not the original text description — it is the accumulated latent state of the previous forty-nine frames, which already contains accumulated drift.

The EPFL approach introduces what the paper calls coherence anchors: reference representations sampled from the early portion of the video that are actively maintained in the model's attention context throughout the entire generation process. Rather than allowing the conditioning signal to be purely temporal (each frame attends to its predecessors), the model maintains a small set of fixed reference latents that represent the "ground truth" of what is being generated — the character's face in the first clear frame, the background in its original state, the object's defined shape.

As generation proceeds, each new frame attends to both the recent temporal context (standard autoregressive conditioning) and these persistent coherence anchors. When the model's forward state begins to drift, the anchor attention pulls it back toward the established reference. The drift does not compound because the error is corrected at every step rather than allowed to accumulate.

The second key contribution is the paper's adaptive anchor update mechanism. Not all anchors should be static. A video in which a character walks from sunlight into shadow should produce a face that changes under different lighting — that is coherent physics, not drift. The update mechanism uses a learned divergence detector to distinguish between intended change (the character moving, expressions shifting, environments transitioning) and unintended drift (the character's face shape gradually morphing due to accumulated error). Anchors are updated when the detected change is intentional and held fixed when the divergence is due to drift.

This distinction — between intentional change and accumulated error — is the genuinely hard problem that previous approaches failed to solve. Earlier temporal consistency methods tended to be too conservative: they suppressed drift but also suppressed legitimate motion and change, producing videos that felt frozen or unnaturally stable. The EPFL adaptive mechanism threads this needle with a learned model rather than a hand-tuned heuristic.

Approach	Drift Prevention	Motion Quality	Max Coherent Length
Standard diffusion (Sora, Veo)	None	High	~5–10 seconds
Temporal attention tuning	Partial	Medium-High	~15–20 seconds
Fixed anchor methods (prior work)	Good	Low (stiff)	~30 seconds
EPFL temporal anchoring	Strong	High	Minutes (demonstrated)

The paper demonstrates coherent generation up to four minutes at high resolution in benchmark evaluations, with character consistency scores that prior methods could not sustain beyond thirty seconds.

What ICLR 2026 acceptance means for credibility

Not every AI paper with a dramatic claim deserves attention. The volume of preprints published weekly on arXiv is large enough that significant noise is unavoidable. ICLR acceptance is the signal that filters that noise.

ICLR — the International Conference on Learning Representations — is one of the three most selective peer-reviewed venues in machine learning, alongside NeurIPS and ICML. Acceptance rates typically fall between 25–32% of submitted papers, with oral presentations reserved for the top 1–2% of submissions. The review process is double-blind and adversarial — reviewers are specifically selected to be skeptical, and authors must respond to critiques in an interactive review phase.

For a video generation paper to be accepted to ICLR 2026, the core claims must be reproducible, the baselines must be fairly chosen and properly implemented, and the evaluation methodology must hold up to scrutiny from researchers who have every incentive to find flaws. The EPFL team's four-minute coherence demonstration was evaluated under these conditions, not just on cherry-picked examples presented in a company blog post.

This matters because the AI video industry has a credibility problem with overstated capability claims. Every commercial product launch from Sora to Veo to Kling involved carefully curated demo clips that did not represent typical output. ICLR acceptance subjects the work to a different standard — one that requires the results to hold up in controlled, reproducible conditions.

The EPFL paper being presented at ICLR 2026 means the approach is real, the improvements over baselines are real, and the technique is implementable by other research groups who want to build on it. That is a meaningfully higher bar than "this clip looks good in a press release."

Minutes-long AI video: what is now possible

The gap between five-second clips and four-minute clips is not linear — it is categorical.

Five-second clips are useful for social media. They can illustrate a concept, animate a product shot, or generate a visual for a post. They are fundamentally inadequate for storytelling, for education, for training videos, for marketing that requires narrative arc, or for any content format that needs a viewer to be engaged across more than a few seconds.

Four-minute coherent AI video changes the category of product that is buildable.

A four-minute video can contain a complete narrative arc. It can introduce a character, show them facing a challenge, and resolve it. It can explain a technical concept with visual examples that build on each other. It can tell a brand story from origin to product to customer outcome. These are the formats that actually move audiences and drive commercial value.

Character consistency across minutes of video means you can now build recurring characters. One of the most commercially valuable things in content creation is a recognizable character — think of how brand mascots work, or how educational YouTube channels are built around a specific presenter. AI video with four-second coherence windows could not produce a recurring character. AI video with four-minute coherence and stable facial features across clips can.

Stable environments across minutes means you can produce multi-scene content. A cooking video with multiple dishes prepared in the same kitchen. A product demonstration with multiple features shown in a consistent setting. A training module that returns to the same interface repeatedly. All of these require environmental consistency that existing tools could not provide.

The shift from "clip generator" to "video content producer" is what EPFL's breakthrough enables. It is not an incremental improvement in a metric — it is the removal of a hard ceiling that defined what the entire category could do.

Commercial implications for Sora, Veo, and Kling

Every major AI video platform has been operating under the same constraint that EPFL just published a solution for. The competitive implications are immediate.

OpenAI's Sora launched in early 2024 to significant excitement and has been iterating on coherence and resolution since. Its current five-to-ten-second coherent window is the primary reason it has not displaced professional video production workflows. Sora has the model quality and the brand — what it lacked was duration. The EPFL technique is published research; OpenAI's team will implement and evaluate it on their architecture within weeks of the paper's release. If the approach transfers cleanly to Sora's architecture (and the general technique is architecture-agnostic enough that it should), this is the upgrade Sora needs to become genuinely disruptive to professional production.

Google's Veo is integrated into Google's broader creative toolchain including YouTube's production features. Google DeepMind has strong video research capabilities and ICLR acceptance ensures the paper receives immediate internal attention. Veo's commercial deployment through YouTube Studio means any duration extension has a direct path to hundreds of millions of content creators.

Kling from Kuaishou has been the dark horse of the commercial video market — it launched with notably strong motion quality and has attracted significant enterprise adoption in China. Kuaishou's engineering team is sophisticated enough to implement the EPFL approach quickly. Given that Chinese AI labs have shown a pattern of rapidly integrating published research breakthroughs, Kling-based implementations could be among the first commercial deployments.

Runway's Gen-3 has built a strong professional user base among video editors and creative agencies. Runway's differentiation has been on professional workflow integration rather than raw generation capability. Extended coherence would directly address the primary limitation their professional users cite.

The competitive dynamic now is straightforward: whoever integrates the EPFL breakthrough first and demonstrates credible four-minute coherent generation in production will capture the segment of the market — advertising agencies, training content producers, marketing teams — that has been waiting for duration to reach usability.

Hollywood and professional content creation impact

The film and television industry has been watching AI video with a mixture of anxiety and skepticism. The anxiety comes from the obvious threat to certain production roles. The skepticism comes from experience: the actual outputs of current AI video tools, seen without cherry-picking, are not yet close to broadcast quality.

The drift problem has been a major source of that skepticism. Cinematographers and directors evaluating AI video tools quickly identify temporal incoherence as the disqualifying characteristic. A shot of an actor that morphs slightly across fifteen seconds is not usable. It does not matter how high the per-frame resolution is or how impressive the motion quality looks in the first three seconds. Professional video requires frame-to-frame consistency as a baseline requirement, not an enhancement.

EPFL's breakthrough changes the skeptic's calculation. If temporal coherence is now solvable, the remaining gaps between AI video and professional production are about resolution, fine detail, physics fidelity, and creative control — all of which are tractable engineering problems that are improving with each model generation. Temporal incoherence was uniquely problematic because it was not a matter of more compute or better training data. It was an architectural flaw. Architectural flaws are fixed with architecture changes, which is exactly what the EPFL paper provides.

The practical near-term impact on professional production is likely in pre-visualization and pitching, not in finished content. Directors and production designers use animatics and rough previsualization to communicate what a shot or sequence should look like before committing expensive production resources. AI video that maintains coherence across minutes is viable for pre-viz even at current quality levels. The cost savings on pre-production visualization alone represent a meaningful market.

Further out — in the two-to-five-year range — the more disruptive scenario is AI-assisted production of specific content categories where cinematic quality is not required but professional consistency is: corporate training videos, educational content, marketing explainers, product demonstrations, e-commerce video. These markets are large, cost-sensitive, and have been underserved by tools that could not produce coherent content beyond a few seconds.

Remaining technical challenges

Solving temporal incoherence does not make AI video a solved problem. It removes one major constraint and reveals the constraints behind it.

Resolution at scale. Generating coherent video at high resolution across four minutes requires substantially more compute than generating five-second clips. The current generation of AI video models runs at resolutions that are acceptable for web content but fall short of broadcast standards at longer durations. The coherence breakthrough does not reduce compute requirements — if anything, maintaining coherence anchors across a long generation adds computational overhead.

Physical consistency. Human faces are the most studied and evaluated objects in visual media — the human visual system is exquisitely sensitive to face anomalies. AI video models are strongest on faces for this reason. But physical consistency of non-face objects — the way water moves, the way cloth folds under tension, the way light scatters through fog — remains inconsistent. These are physics simulation problems that the EPFL approach does not address directly.

Camera control. Professional video production relies on precise camera movement: specific focal lengths, choreographed tracking shots, motivated cut angles. Current AI video generation offers limited, imprecise camera control. Temporal coherence across a four-minute clip with inconsistent or uncontrollable camera motion still limits creative utility.

Audio-visual synchronization. Commercial video content almost always includes audio — music, voiceover, dialogue, sound effects. Synchronizing AI-generated video with audio, or generating both together coherently, is a separate problem that the EPFL paper does not address.

Creative direction fidelity. The gap between what a professional director intends and what a text-to-video prompt produces is still large. The EPFL technique maintains consistency within a generation — but the initial generation still requires significant prompt engineering to get the desired output, and iteration is still expensive.

These remaining challenges are significant, but they are qualitatively different from temporal incoherence. They are refinement problems. Temporal incoherence was a structural ceiling. The difference matters for how quickly the remaining gap closes.

The competitive landscape after this paper

The publication of the EPFL paper at ICLR 2026 does more than give commercial players a technique to implement. It resets the research baseline for the entire field.

Before this paper, every video generation research group was working around the drift problem — either accepting it as a constraint or applying partial mitigations that left the underlying architecture unchanged. The EPFL approach is now the documented, peer-reviewed solution. Future research in the field will build on it rather than work around it.

This has a specific implication for academic research groups and the labs that recruit from them: the research agenda for AI video now shifts from "how do we extend coherence to more than a few seconds" to "how do we push coherent generation to feature length, at broadcast resolution, with precise creative control." Those are harder problems, but they are the right problems — the problems whose solutions produce genuinely transformative tools.

For startups building in the AI video space, the paper is a forcing function. The technique is implementable, and the larger labs will implement it quickly. Any startup whose differentiation was "we produce longer coherent clips than Sora" needs to reconsider its positioning. The window in which that was a defensible advantage is closing.

For enterprise customers evaluating AI video platforms, the paper provides a clear question to ask vendors: have you implemented the EPFL temporal anchoring technique, and what are your benchmark results on coherent duration at your production resolution? The answer to that question will quickly separate the vendors who have genuinely solved the problem from those who are still shipping the same constrained architecture with marketing copy adjusted to reference the research.

What to expect in the next 12 months

The publication of a breakthrough paper at ICLR typically has a predictable arc in AI: preprint circulates among researchers, major labs implement internally within weeks, commercial products begin reflecting the improvement within three to six months, and public awareness catches up to reality three to twelve months after that.

For the EPFL temporal incoherence paper, the timeline looks roughly as follows.

In the near term — within 90 days — expect to see the major AI video labs publish blog posts or technical notes referencing the EPFL technique and describing their implementation. Expect third-party researchers to publish replication results confirming or refining the approach. Expect benchmark comparisons from independent evaluators quantifying the improvement on standard temporal consistency metrics.

Within six months, expect at least one major commercial AI video platform to demonstrate credible one-to-two-minute coherent generation in a public beta. The most likely candidates are Sora (OpenAI has the engineering depth to move quickly) and Veo (Google DeepMind is a top-tier research institution with direct video product deployment).

Within twelve months, expect multi-minute coherent AI video to be a table-stakes feature of any serious commercial video generation platform. The remaining competitive differentiation will shift entirely to resolution, creative control, audio integration, and pricing — all of which are domains where the competitive landscape is more fragmented and where smaller, specialized players can carve out sustainable positions.

The EPFL breakthrough is not the end of AI video development. It is the moment at which the development agenda shifted from "make the product minimally viable" to "make the product genuinely good." That transition is worth paying attention to.

Frequently asked questions

What exactly is the "drift problem" in AI video?

Temporal drift is the accumulated error that occurs when AI video models generate frames sequentially. Each frame introduces small stochastic errors through the diffusion sampling process. These errors compound over time until the model's internal representation of what it is generating diverges significantly from its original state. The result is visible as characters whose faces gradually morph, environments that shift subtly across a clip, or objects that change shape without any physical reason. It is an architectural problem inherent to how diffusion-based video models work, not a bug that can be patched in software.

How does the EPFL solution actually work?

The EPFL team introduces "coherence anchors" — reference representations sampled from the early frames of a video that are maintained in the model's attention context throughout the entire generation. Each new frame attends to both recent frames (standard temporal conditioning) and these fixed reference representations, preventing drift from compounding. A learned adaptive mechanism distinguishes between intentional change (a character moving or lighting changing) and unintended drift, updating anchors for the former and holding them fixed for the latter.

Why is ICLR 2026 acceptance significant?

ICLR is one of the three most selective peer-reviewed venues in machine learning, with acceptance rates around 25–32% and rigorous double-blind review. Acceptance means the core claims have been reproduced and validated by independent experts, the experimental methodology is sound, and the improvement over prior baselines is genuine and measurable — not cherry-picked. This is a higher credibility bar than a company blog post or preprint announcement.

Does this mean AI-generated feature films are now possible?

Not immediately. The EPFL breakthrough solves temporal incoherence, but significant challenges remain: resolution at broadcast quality, physical consistency of non-face objects, precise camera control, audio-visual synchronization, and creative direction fidelity. The breakthrough removes the single most fundamental architectural barrier, but the remaining gaps are real. Feature-length AI-generated film is now a tractable long-term goal rather than an impossible one, but the realistic near-term impact is in pre-visualization, corporate video, educational content, and marketing — not theatrical release.

How quickly will commercial tools like Sora and Veo implement this?

Major AI labs move quickly on published research that addresses a known bottleneck in their products. Expect internal implementation and evaluation within weeks of the paper's publication, and commercial product updates reflecting the improvement within three to six months. The technique is described in enough detail in the paper to be implementable by any team with strong video model engineering capabilities, which all the major players have.

Will this eliminate jobs in video production?

The breakthrough makes AI video significantly more capable, but capability gaps remain that preserve professional production value. The near-term impact is most significant in pre-visualization, where AI video now becomes viable for communicating directorial intent before production. The medium-term impact affects cost-sensitive content categories: corporate training, marketing explainers, educational content, e-commerce. High-end cinematic production for theatrical or prestige television remains differentiated by creative direction fidelity, actor performance capture, and production design — areas where AI video is still a tool rather than a replacement.

Can I use this technique myself?

The paper is published and will be available through ICLR's open-access proceedings. Implementing it requires a trained video diffusion model and the engineering capacity to modify the attention architecture and add the anchor maintenance mechanism. Academic researchers and well-resourced engineering teams can implement it from the paper. For most end users, the technique will become accessible when commercial platforms integrate it into their products.

Let's Build Something Together

EPFL solves the drift problem that capped AI video at a few seconds

On this page

Weekly Newsletter