Tencent's Penguin-VL Ditches CLIP and Beats Every Rival VLM…

TL;DR

Tencent AI Lab just released Penguin-VL, a pair of vision-language models (2B and 8B parameters) that replace the standard CLIP-based vision encoder — the component responsible for "seeing" — with an encoder initialized from a text-only language model. The result is a VLM that tops the charts on document understanding, OCR, and video comprehension while using the same compute budget as smaller, weaker competitors. The models, encoder weights, and code are all open-source, and the paper landed as the #1 trending paper on Hugging Face on March 9, 2026.

The Problem With How VLMs See
What Penguin-VL Actually Does Differently
The Architecture Explained
Benchmark Results: The Numbers
Who Built It and Why It Matters
Practical Applications
Limitations and Open Questions
Industry Implications
FAQ
Conclusion

The Problem With How VLMs See

To understand why Penguin-VL is interesting, you first need to understand a dirty secret about how most vision-language models (VLMs) are built today.

Models like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5, Qwen3-VL, and InternVL all follow roughly the same recipe:

Take a vision encoder pre-trained using contrastive learning (most commonly CLIP from OpenAI, or SigLIP from Google).
Connect it to a large language model through a small adapter (a "projector").
Fine-tune the whole system on image-text pairs.

The vision encoder's job is to look at an image and convert it into a sequence of numerical representations — essentially a summary of what's in the picture — that the language model can then reason about.

CLIP and SigLIP are trained through a process called contrastive learning: the model learns by comparing image-text pairs and figuring out which image goes with which caption. This is enormously powerful for understanding what's in a scene at a conceptual level — is this a cat or a dog? Is this a beach or a forest?

But contrastive learning has a fundamental limitation that the Penguin-VL team identified: it optimizes for discrimination, not description.

When CLIP learns to match "a photo of a dog" with an image of a dog, it learns to suppress everything that's not relevant to that distinction. Fine-grained spatial details, tiny text, subtle geometric relationships, the precise position of a data point on a chart — all of this gets smoothed away in favor of high-level categorical features.

This is why, if you've ever noticed that frontier VLMs struggle with questions like "what is the exact value shown in the third bar of this chart?" or "read the small print at the bottom of this contract" — that's largely a CLIP problem. The vision encoder was simply never trained to preserve that kind of fine-grained detail.

For years, the field's answer to this was: make the model bigger, use more training data, or add specialized post-processing layers. The Penguin-VL team took a different approach entirely.

What Penguin-VL Actually Does Differently

The central innovation of Penguin-VL is deceptively simple: don't use a contrastively trained vision encoder at all.

Instead, initialize the vision encoder from a small text-only language model — specifically, from Qwen3-0.6B. Then adapt this text model to process images rather than tokens.

Why would a language model make a better vision encoder than something specifically designed for vision?

The key insight is that large language models are trained to preserve and process fine-grained information. When you predict the next token in "The interest rate shown in Figure 3 is 4.7%", you need to remember precisely what was in Figure 3, including the exact number. Language modeling as a pre-training objective pushes representations to be informationally complete rather than categorically discriminative.

The team's hypothesis: a vision encoder initialized from an LLM and then trained on vision tasks will naturally preserve the kinds of fine-grained spatial and textual cues that contrastive pre-training actively discards. The empirical results they present suggest this hypothesis holds up well.

The Architecture Explained

The Penguin Encoder

The vision encoder — which the team calls Penguin-Encoder — starts as Qwen3-0.6B (a 600M parameter text-only model) and is adapted for vision through several modifications:

Bidirectional attention: Standard language models use causal (left-to-right) attention where each token can only attend to previous tokens. For vision understanding, you want each image patch to attend to all other patches simultaneously. Penguin-Encoder switches to bidirectional attention for the visual modality.
2D-RoPE (2D Rotary Position Embeddings): Images have two-dimensional spatial structure that 1D text position embeddings don't capture. Penguin-Encoder uses 2D-RoPE to encode both horizontal and vertical position information for each image patch.
Image tokenization: Raw image pixels are converted into patch tokens before being fed into the encoder, similar to how ViT (Vision Transformer) works.

After encoding, a lightweight MLP projector maps the encoder's output into the language backbone's embedding space.

The Full Models

Penguin-VL-2B: Penguin-Encoder + Qwen3-2B language backbone (approximately 2.6B total parameters)
Penguin-VL-8B: Penguin-Encoder + Qwen3-8B language backbone (approximately 9B total parameters)

The encoder itself is released separately as Penguin-Encoder on Hugging Face, allowing researchers to swap it into other VLM architectures.

Training Stages

The team follows a standard multi-stage training recipe but with an important distinction: because the encoder starts from an LLM rather than a CLIP model, the alignment phase between vision encoder and language backbone is approached differently. The encoder doesn't need to be retrained to produce "language-like" representations — it already produces them, having been initialized from one.

Benchmark Results: The Numbers

The proof is in the benchmarks. Tencent tested both models against the strongest competitors in their respective size classes.

Penguin-VL-2B vs the Competition

Chart / OCR / Document Understanding

Benchmark	Penguin-VL 2B	Qwen3-VL 2B	InternVL3.5 2B	Gemma3n E2B	SmolVLM2 2.2B
InfoVQA	77.8	72.4	70.8	51.9	43.0
ChartQA	86.6	76.9	80.7	65.8	68.7
DocVQA	94.1	93.3	89.4	78.4	80.0
OCRBench	858	810	836	700	729

General Reasoning

Benchmark	Penguin-VL 2B	Qwen3-VL 2B	InternVL3.5 2B	Gemma3n E2B	SmolVLM2 2.2B
AI2D	80.7	76.9	78.8	74.6	70.0
RealWorldQA	70.2	63.9	62.0	59.9	58.3
V-star	83.8	74.9	69.1	46.0	51.8
MathVista	67.3	61.3	60.8	50.4	51.5

Video Understanding

Benchmark	Penguin-VL 2B	Qwen3-VL 2B	InternVL3.5 2B	Gemma3n E2B	SmolVLM2 2.2B
LongVideoBench	59.5	52.1	57.4	43.0	49.7
CharadesSTA	56.2	54.5	21.9	5.5	9.5
NextQA	79.9	76.9	76.1	65.4	62.4
Perception Test	70.4	64.5	64.7	48.6	51.6

The 2B model leads on 14 out of 17 benchmarks tested against Qwen3-VL 2B (the previous leader at this scale), InternVL3.5 2B, Gemma3n E2B, and SmolVLM2 2.2B. The margins are largest on document understanding tasks — exactly where the CLIP limitation hypothesis predicts they should be.

Penguin-VL-8B vs the Competition

Chart / OCR / Document Understanding

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	GPT-5 nano
InfoVQA	86.8	83.1	79.1	49.2
ChartQA	90.5	89.6	86.7	48.6
DocVQA	96.2	96.1	92.3	78.3

General Reasoning

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	GPT-5 nano
AI2D	86.1	85.7	84.0	65.7
RealWorldQA	75.8	71.5	67.5	60.7
MathVista	77.4	77.2	74.2	40.9

Video Understanding

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	GPT-5 nano
LongVideoBench	67.0	62.6	62.1	38.1
CharadesSTA	61.4	56.0	32.8	5.0
NextQA	85.4	82.3	81.3	59.3
Perception Test	78.0	72.7	72.7	–

The 8B model is similarly dominant. Notably, GPT-5 nano — OpenAI's lightweight commercial model — significantly underperforms both Penguin-VL variants across all tested benchmarks. The gap is particularly stark on video understanding tasks, where GPT-5 nano scores 5.0 on CharadesSTA versus Penguin-VL-8B's 61.4.

Where Penguin-VL Falls Behind

The results aren't a clean sweep. Qwen3-VL 8B still leads on MMMU-Pro (55.9 vs 40.2 for Penguin-VL-8B) — a graduate-level academic reasoning benchmark — and on CharXiv reasoning questions (46.4 vs 40.0). These gaps likely reflect differences in training data and language model quality for abstract multi-step reasoning rather than anything fundamental about the vision encoder approach.

Who Built It and Why It Matters

The work comes from Tencent AI Lab, the research division behind WeChat, the world's most-used messaging platform. Tencent has been steadily building out its VLM research capabilities, and Penguin-VL represents one of the lab's most significant open-source releases.

The team chose to open-source everything: both VLM variants, the standalone Penguin-Encoder, and the training code. This is significant because the Penguin-Encoder alone — a vision encoder that can be dropped into any VLM architecture — may prove to be the paper's most lasting contribution.

The fact that the encoder weights are available separately means researchers can run ablation studies comparing CLIP vs. LLM-initialized encoders in other VLM architectures, helping establish whether the performance gains generalize beyond the Penguin-VL training recipe.

Practical Applications

The benchmark gains are concentrated in exactly the areas that matter most for real-world enterprise deployment:

Document Processing

A DocVQA score of 94.1% at 2B parameters means Penguin-VL-2B can answer questions about the content of scanned documents — contracts, invoices, forms, reports — at near-human accuracy while running on hardware that costs a fraction of what frontier models require. For companies processing thousands of documents a day, this is a meaningful cost reduction.

Chart and Data Understanding

ChartQA scores of 86.6% (2B) and 90.5% (8B) mean the model can reliably read and interpret data visualizations. This is valuable for financial analysis, scientific paper processing, business intelligence dashboards, and any context where humans or systems need to extract quantitative information from visual charts.

Video Comprehension

The strong video benchmarks — particularly CharadesSTA (moment retrieval in videos) and LongVideoBench (long-form video QA) — suggest Penguin-VL is well-suited for video surveillance, educational content processing, meeting transcription and understanding, and any application requiring temporal reasoning over video.

Mobile and Edge Deployment

The 2B model variant is small enough to run on high-end smartphones and edge devices. For applications like real-time document scanning, accessibility tools for visually impaired users, or offline enterprise deployments in regulated industries, a 2B model that beats previous 7B-scale models is a significant capability unlock.

Limitations and Open Questions

Despite the impressive results, several questions remain open:

Training data transparency: The paper does not fully disclose what data was used to train the Penguin-Encoder or to fine-tune the VLMs. This makes it difficult to isolate whether the performance gains come from the architectural innovation or from a superior data pipeline.

Abstract reasoning gap: As noted in the benchmarks, Penguin-VL-8B lags Qwen3-VL-8B on MMMU-Pro by 15+ points. Abstract, multi-step academic reasoning appears to be a weakness. This may improve with scale, but the gap is notable.

Long-context vision: The paper doesn't extensively evaluate on tasks requiring simultaneous reasoning over many images or very long image sequences. Some enterprise applications (multi-page document processing, multi-camera video) may expose new limitations.

Reproducibility: While code and weights are open-sourced, reproducing the full training pipeline at the reported quality likely requires significant compute infrastructure and engineering effort beyond what's accessible to most researchers.

The benchmark overfitting question: Any model that specifically optimizes for known benchmarks will show inflated scores. Without independent third-party evaluation on held-out tasks, some caution about the absolute magnitude of the performance claims is warranted.

Industry Implications

The implications of Penguin-VL extend beyond the specific model release.

Challenging the CLIP Monoculture

CLIP has functioned as the de facto standard vision encoder for VLMs since its 2021 release. The VLM field has iterated on language backbone scaling, training data curation, and alignment techniques — but the vision encoder has largely remained CLIP or one of its successors (SigLIP, CLIP-ViT-L, etc.).

If the Penguin-VL results hold up to independent scrutiny, they represent the first credible challenge to that monoculture. A vision encoder initialized from an LLM is a genuinely different architectural bet, and the benchmark evidence — especially on document and video tasks — is strong enough that other labs will likely run their own experiments.

This matters for AI research more broadly: if the entire field has been leaving performance on the table by defaulting to CLIP, there's significant headroom to recover simply by changing the encoder initialization strategy.

Open-Source Competitiveness

Penguin-VL-8B outperforms GPT-5 nano — a closed commercial model — on most tested benchmarks. This continues the trend of open-source models closing the gap with proprietary ones. For enterprises considering deployment options, a freely available model that beats a paid API on key tasks is a compelling argument for self-hosting.

The Encoder-as-Component Pattern

By releasing Penguin-Encoder separately from the full VLM, Tencent is making a bet that vision encoders should be modular, swappable components rather than baked into end-to-end systems. This mirrors how the NLP field eventually standardized on plug-and-play transformer encoders. If Penguin-Encoder gets adopted in third-party VLMs and consistently outperforms CLIP, the architectural shift could happen faster than any single model release would suggest.

Multimodal AI in Science

The ability to reliably read charts, tables, and figures — which Penguin-VL dramatically improves — has direct implications for AI-assisted scientific research. Scientific literature is dense with data visualizations. A VLM that can accurately parse them opens up new possibilities for automated literature synthesis, hypothesis generation from published results, and cross-paper reasoning.

This isn't far off: a research assistant that can read through 500 papers and accurately extract the benchmark tables, figure out what each study actually found, and identify inconsistencies between claims — that's enormously valuable, and it requires exactly the fine-grained visual understanding that Penguin-VL improves on.

FAQ

Q: Why is CLIP's "discrimination bias" actually a problem in practice?

CLIP learns to match images to text descriptions by contrasting correct pairs against incorrect ones. This trains it to be very good at recognizing that "this image contains a cat" and "this image does not contain a dog" — but it suppresses variation within categories. Two images of different financial charts will be encoded similarly if they both "look like a chart," even if the numbers are completely different. For tasks that require extracting those numbers, CLIP-based encoders are fundamentally limited.

Q: Does this mean CLIP is useless?

Not at all. CLIP remains excellent for tasks like image retrieval, zero-shot image classification, semantic image search, and any application where you need to match images and text at a conceptual level. The Penguin-VL results suggest it's suboptimal specifically for fine-grained perception tasks within VLMs — a different use case from what CLIP was originally designed for.

Q: Why initialize the vision encoder from a text model? Doesn't it know nothing about images?

That's exactly the point. The initialization provides a representational infrastructure (attention patterns, layer normalization, residual connections, and crucially, the objective of preserving fine-grained information) that happens to be better suited to detailed visual understanding than CLIP's discriminative objective. The encoder is then trained on visual data, but it starts from a different representational prior.

Q: Can I run Penguin-VL-2B locally?

The 2B model requires approximately 4-6GB of VRAM in BF16 precision, making it runnable on a modern gaming GPU (RTX 3080 or better). The 8B model needs 18-20GB of VRAM, so requires a high-end consumer card or a professional GPU. Both models are available on Hugging Face as standard transformer models compatible with the transformers library.

Q: How does this compare to work on better visual tokenization (like from AI2 or others)?

Penguin-VL takes a different approach from work on visual tokenization (converting images into discrete tokens). While visual tokenizers try to compress image information more efficiently, Penguin-VL focuses on preserving more information in the encoder's continuous representations. These approaches are not mutually exclusive — it's plausible a future model could combine an LLM-initialized encoder with improved visual tokenization for further gains. See also AI2's OLMo data efficiency work for related thinking on architectural efficiency.

Conclusion

Penguin-VL's core contribution is an architectural insight that the field will need to engage with seriously: the standard CLIP vision encoder that every VLM inherits may be the wrong foundation for fine-grained visual understanding tasks.

The benchmark evidence is strong. Penguin-VL-2B leads on 14 of 17 tested benchmarks against the best models in its size class. Penguin-VL-8B consistently outperforms InternVL3.5-8B and GPT-5 nano on document understanding and video tasks. These aren't marginal improvements — on CharadesSTA (video temporal grounding), Penguin-VL-8B scores 61.4 vs InternVL3.5's 32.8.

There are caveats: abstract reasoning remains a weakness, training data provenance is not fully disclosed, and independent third-party evaluation hasn't happened yet. But the direction is clear enough that other major labs — including the ones that have relied on CLIP for years — will be running their own experiments now.

The Penguin-Encoder's separate release is the most strategically interesting element. If researchers find it improves results when dropped into other architectures, the case for an LLM-initialized vision encoder as the new default becomes substantially stronger. That would represent a meaningful architectural shift in a field that has taken CLIP's centrality largely for granted.

Watch this space. The CLIP era in VLMs may be ending.

Resources:

Let's Build Something Together

Tencent's Penguin-VL Ditches CLIP and Beats Every Rival VLM Under 10B Parameters

Weekly Newsletter