Tencent's Penguin-VL Ditches CLIP and Beats Every Rival VLM Under 10B Parameters
Tencent AI Lab's Penguin-VL replaces CLIP vision encoders with LLM-initialized encoders, setting new SOTA on doc understanding and video benchmarks at 2B and 8B scale.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
Tencent AI Lab just released Penguin-VL, a pair of vision-language models (2B and 8B parameters) that replace the standard CLIP-based vision encoder — the component responsible for "seeing" — with an encoder initialized from a text-only language model. The result is a VLM that tops the charts on document understanding, OCR, and video comprehension while using the same compute budget as smaller, weaker competitors. The models, encoder weights, and code are all open-source, and the paper landed as the #1 trending paper on Hugging Face on March 9, 2026.
To understand why Penguin-VL is interesting, you first need to understand a dirty secret about how most vision-language models (VLMs) are built today.
Models like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5, Qwen3-VL, and InternVL all follow roughly the same recipe:
The vision encoder's job is to look at an image and convert it into a sequence of numerical representations — essentially a summary of what's in the picture — that the language model can then reason about.
CLIP and SigLIP are trained through a process called contrastive learning: the model learns by comparing image-text pairs and figuring out which image goes with which caption. This is enormously powerful for understanding what's in a scene at a conceptual level — is this a cat or a dog? Is this a beach or a forest?
But contrastive learning has a fundamental limitation that the Penguin-VL team identified: it optimizes for discrimination, not description.
When CLIP learns to match "a photo of a dog" with an image of a dog, it learns to suppress everything that's not relevant to that distinction. Fine-grained spatial details, tiny text, subtle geometric relationships, the precise position of a data point on a chart — all of this gets smoothed away in favor of high-level categorical features.
This is why, if you've ever noticed that frontier VLMs struggle with questions like "what is the exact value shown in the third bar of this chart?" or "read the small print at the bottom of this contract" — that's largely a CLIP problem. The vision encoder was simply never trained to preserve that kind of fine-grained detail.
For years, the field's answer to this was: make the model bigger, use more training data, or add specialized post-processing layers. The Penguin-VL team took a different approach entirely.
The central innovation of Penguin-VL is deceptively simple: don't use a contrastively trained vision encoder at all.
Instead, initialize the vision encoder from a small text-only language model — specifically, from Qwen3-0.6B. Then adapt this text model to process images rather than tokens.
Why would a language model make a better vision encoder than something specifically designed for vision?
The key insight is that large language models are trained to preserve and process fine-grained information. When you predict the next token in "The interest rate shown in Figure 3 is 4.7%", you need to remember precisely what was in Figure 3, including the exact number. Language modeling as a pre-training objective pushes representations to be informationally complete rather than categorically discriminative.
The team's hypothesis: a vision encoder initialized from an LLM and then trained on vision tasks will naturally preserve the kinds of fine-grained spatial and textual cues that contrastive pre-training actively discards. The empirical results they present suggest this hypothesis holds up well.
The vision encoder — which the team calls Penguin-Encoder — starts as Qwen3-0.6B (a 600M parameter text-only model) and is adapted for vision through several modifications:
After encoding, a lightweight MLP projector maps the encoder's output into the language backbone's embedding space.
The encoder itself is released separately as Penguin-Encoder on Hugging Face, allowing researchers to swap it into other VLM architectures.
The team follows a standard multi-stage training recipe but with an important distinction: because the encoder starts from an LLM rather than a CLIP model, the alignment phase between vision encoder and language backbone is approached differently. The encoder doesn't need to be retrained to produce "language-like" representations — it already produces them, having been initialized from one.
The proof is in the benchmarks. Tencent tested both models against the strongest competitors in their respective size classes.
Chart / OCR / Document Understanding
| Benchmark | Penguin-VL 2B | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B | SmolVLM2 2.2B |
|---|---|---|---|---|---|
| InfoVQA | 77.8 | 72.4 | 70.8 | 51.9 | 43.0 |
| ChartQA | 86.6 | 76.9 | 80.7 | 65.8 | 68.7 |
| DocVQA | 94.1 | 93.3 | 89.4 | 78.4 | 80.0 |
| OCRBench | 858 | 810 | 836 | 700 | 729 |
General Reasoning
| Benchmark | Penguin-VL 2B | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B | SmolVLM2 2.2B |
|---|---|---|---|---|---|
| AI2D | 80.7 | 76.9 | 78.8 | 74.6 | 70.0 |
| RealWorldQA | 70.2 | 63.9 | 62.0 | 59.9 | 58.3 |
| V-star | 83.8 | 74.9 | 69.1 | 46.0 | 51.8 |
| MathVista | 67.3 | 61.3 | 60.8 | 50.4 | 51.5 |
Video Understanding
| Benchmark | Penguin-VL 2B | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B | SmolVLM2 2.2B |
|---|---|---|---|---|---|
| LongVideoBench | 59.5 | 52.1 | 57.4 | 43.0 | 49.7 |
| CharadesSTA | 56.2 | 54.5 | 21.9 | 5.5 | 9.5 |
| NextQA | 79.9 | 76.9 | 76.1 | 65.4 | 62.4 |
| Perception Test | 70.4 | 64.5 | 64.7 | 48.6 | 51.6 |
The 2B model leads on 14 out of 17 benchmarks tested against Qwen3-VL 2B (the previous leader at this scale), InternVL3.5 2B, Gemma3n E2B, and SmolVLM2 2.2B. The margins are largest on document understanding tasks — exactly where the CLIP limitation hypothesis predicts they should be.
Chart / OCR / Document Understanding
| Benchmark | Penguin-VL 8B | Qwen3-VL 8B | InternVL3.5 8B | GPT-5 nano |
|---|---|---|---|---|
| InfoVQA | 86.8 | 83.1 | 79.1 | 49.2 |
| ChartQA | 90.5 | 89.6 | 86.7 | 48.6 |
| DocVQA | 96.2 | 96.1 | 92.3 | 78.3 |
General Reasoning
| Benchmark | Penguin-VL 8B | Qwen3-VL 8B | InternVL3.5 8B | GPT-5 nano |
|---|---|---|---|---|
| AI2D | 86.1 | 85.7 | 84.0 | 65.7 |
| RealWorldQA | 75.8 | 71.5 | 67.5 | 60.7 |
| MathVista | 77.4 | 77.2 | 74.2 | 40.9 |
Video Understanding
| Benchmark | Penguin-VL 8B | Qwen3-VL 8B | InternVL3.5 8B | GPT-5 nano |
|---|---|---|---|---|
| LongVideoBench | 67.0 | 62.6 | 62.1 | 38.1 |
| CharadesSTA | 61.4 | 56.0 | 32.8 | 5.0 |
| NextQA | 85.4 | 82.3 | 81.3 | 59.3 |
| Perception Test | 78.0 | 72.7 | 72.7 | – |
The 8B model is similarly dominant. Notably, GPT-5 nano — OpenAI's lightweight commercial model — significantly underperforms both Penguin-VL variants across all tested benchmarks. The gap is particularly stark on video understanding tasks, where GPT-5 nano scores 5.0 on CharadesSTA versus Penguin-VL-8B's 61.4.
The results aren't a clean sweep. Qwen3-VL 8B still leads on MMMU-Pro (55.9 vs 40.2 for Penguin-VL-8B) — a graduate-level academic reasoning benchmark — and on CharXiv reasoning questions (46.4 vs 40.0). These gaps likely reflect differences in training data and language model quality for abstract multi-step reasoning rather than anything fundamental about the vision encoder approach.
The work comes from Tencent AI Lab, the research division behind WeChat, the world's most-used messaging platform. Tencent has been steadily building out its VLM research capabilities, and Penguin-VL represents one of the lab's most significant open-source releases.
The team chose to open-source everything: both VLM variants, the standalone Penguin-Encoder, and the training code. This is significant because the Penguin-Encoder alone — a vision encoder that can be dropped into any VLM architecture — may prove to be the paper's most lasting contribution.
The fact that the encoder weights are available separately means researchers can run ablation studies comparing CLIP vs. LLM-initialized encoders in other VLM architectures, helping establish whether the performance gains generalize beyond the Penguin-VL training recipe.
The benchmark gains are concentrated in exactly the areas that matter most for real-world enterprise deployment:
A DocVQA score of 94.1% at 2B parameters means Penguin-VL-2B can answer questions about the content of scanned documents — contracts, invoices, forms, reports — at near-human accuracy while running on hardware that costs a fraction of what frontier models require. For companies processing thousands of documents a day, this is a meaningful cost reduction.
ChartQA scores of 86.6% (2B) and 90.5% (8B) mean the model can reliably read and interpret data visualizations. This is valuable for financial analysis, scientific paper processing, business intelligence dashboards, and any context where humans or systems need to extract quantitative information from visual charts.
The strong video benchmarks — particularly CharadesSTA (moment retrieval in videos) and LongVideoBench (long-form video QA) — suggest Penguin-VL is well-suited for video surveillance, educational content processing, meeting transcription and understanding, and any application requiring temporal reasoning over video.
The 2B model variant is small enough to run on high-end smartphones and edge devices. For applications like real-time document scanning, accessibility tools for visually impaired users, or offline enterprise deployments in regulated industries, a 2B model that beats previous 7B-scale models is a significant capability unlock.
Despite the impressive results, several questions remain open:
Training data transparency: The paper does not fully disclose what data was used to train the Penguin-Encoder or to fine-tune the VLMs. This makes it difficult to isolate whether the performance gains come from the architectural innovation or from a superior data pipeline.
Abstract reasoning gap: As noted in the benchmarks, Penguin-VL-8B lags Qwen3-VL-8B on MMMU-Pro by 15+ points. Abstract, multi-step academic reasoning appears to be a weakness. This may improve with scale, but the gap is notable.
Long-context vision: The paper doesn't extensively evaluate on tasks requiring simultaneous reasoning over many images or very long image sequences. Some enterprise applications (multi-page document processing, multi-camera video) may expose new limitations.
Reproducibility: While code and weights are open-sourced, reproducing the full training pipeline at the reported quality likely requires significant compute infrastructure and engineering effort beyond what's accessible to most researchers.
The benchmark overfitting question: Any model that specifically optimizes for known benchmarks will show inflated scores. Without independent third-party evaluation on held-out tasks, some caution about the absolute magnitude of the performance claims is warranted.
The implications of Penguin-VL extend beyond the specific model release.
CLIP has functioned as the de facto standard vision encoder for VLMs since its 2021 release. The VLM field has iterated on language backbone scaling, training data curation, and alignment techniques — but the vision encoder has largely remained CLIP or one of its successors (SigLIP, CLIP-ViT-L, etc.).
If the Penguin-VL results hold up to independent scrutiny, they represent the first credible challenge to that monoculture. A vision encoder initialized from an LLM is a genuinely different architectural bet, and the benchmark evidence — especially on document and video tasks — is strong enough that other labs will likely run their own experiments.
This matters for AI research more broadly: if the entire field has been leaving performance on the table by defaulting to CLIP, there's significant headroom to recover simply by changing the encoder initialization strategy.
Penguin-VL-8B outperforms GPT-5 nano — a closed commercial model — on most tested benchmarks. This continues the trend of open-source models closing the gap with proprietary ones. For enterprises considering deployment options, a freely available model that beats a paid API on key tasks is a compelling argument for self-hosting.
By releasing Penguin-Encoder separately from the full VLM, Tencent is making a bet that vision encoders should be modular, swappable components rather than baked into end-to-end systems. This mirrors how the NLP field eventually standardized on plug-and-play transformer encoders. If Penguin-Encoder gets adopted in third-party VLMs and consistently outperforms CLIP, the architectural shift could happen faster than any single model release would suggest.
The ability to reliably read charts, tables, and figures — which Penguin-VL dramatically improves — has direct implications for AI-assisted scientific research. Scientific literature is dense with data visualizations. A VLM that can accurately parse them opens up new possibilities for automated literature synthesis, hypothesis generation from published results, and cross-paper reasoning.
This isn't far off: a research assistant that can read through 500 papers and accurately extract the benchmark tables, figure out what each study actually found, and identify inconsistencies between claims — that's enormously valuable, and it requires exactly the fine-grained visual understanding that Penguin-VL improves on.
Q: Why is CLIP's "discrimination bias" actually a problem in practice?
CLIP learns to match images to text descriptions by contrasting correct pairs against incorrect ones. This trains it to be very good at recognizing that "this image contains a cat" and "this image does not contain a dog" — but it suppresses variation within categories. Two images of different financial charts will be encoded similarly if they both "look like a chart," even if the numbers are completely different. For tasks that require extracting those numbers, CLIP-based encoders are fundamentally limited.
Q: Does this mean CLIP is useless?
Not at all. CLIP remains excellent for tasks like image retrieval, zero-shot image classification, semantic image search, and any application where you need to match images and text at a conceptual level. The Penguin-VL results suggest it's suboptimal specifically for fine-grained perception tasks within VLMs — a different use case from what CLIP was originally designed for.
Q: Why initialize the vision encoder from a text model? Doesn't it know nothing about images?
That's exactly the point. The initialization provides a representational infrastructure (attention patterns, layer normalization, residual connections, and crucially, the objective of preserving fine-grained information) that happens to be better suited to detailed visual understanding than CLIP's discriminative objective. The encoder is then trained on visual data, but it starts from a different representational prior.
Q: Can I run Penguin-VL-2B locally?
The 2B model requires approximately 4-6GB of VRAM in BF16 precision, making it runnable on a modern gaming GPU (RTX 3080 or better). The 8B model needs 18-20GB of VRAM, so requires a high-end consumer card or a professional GPU. Both models are available on Hugging Face as standard transformer models compatible with the transformers library.
Q: How does this compare to work on better visual tokenization (like from AI2 or others)?
Penguin-VL takes a different approach from work on visual tokenization (converting images into discrete tokens). While visual tokenizers try to compress image information more efficiently, Penguin-VL focuses on preserving more information in the encoder's continuous representations. These approaches are not mutually exclusive — it's plausible a future model could combine an LLM-initialized encoder with improved visual tokenization for further gains. See also AI2's OLMo data efficiency work for related thinking on architectural efficiency.
Penguin-VL's core contribution is an architectural insight that the field will need to engage with seriously: the standard CLIP vision encoder that every VLM inherits may be the wrong foundation for fine-grained visual understanding tasks.
The benchmark evidence is strong. Penguin-VL-2B leads on 14 of 17 tested benchmarks against the best models in its size class. Penguin-VL-8B consistently outperforms InternVL3.5-8B and GPT-5 nano on document understanding and video tasks. These aren't marginal improvements — on CharadesSTA (video temporal grounding), Penguin-VL-8B scores 61.4 vs InternVL3.5's 32.8.
There are caveats: abstract reasoning remains a weakness, training data provenance is not fully disclosed, and independent third-party evaluation hasn't happened yet. But the direction is clear enough that other major labs — including the ones that have relied on CLIP for years — will be running their own experiments now.
The Penguin-Encoder's separate release is the most strategically interesting element. If researchers find it improves results when dropped into other architectures, the case for an LLM-initialized vision encoder as the new default becomes substantially stronger. That would represent a meaningful architectural shift in a field that has taken CLIP's centrality largely for granted.
Watch this space. The CLIP era in VLMs may be ending.
Resources:
CERN's ATLAS and CMS teams are deploying AI anomaly detection to scan Large Hadron Collider collision data for 'exotic-looking' events that challenge the Standard Model of physics.
Quantiphi's acquisition of Candyspace unites generative AI engineering with award-winning digital product design, serving ITV, Rolls-Royce, and Mazda.
Caitlin Kalinowski resigned from OpenAI over the Pentagon AI deal, citing rushed governance and fears of warrantless surveillance and autonomous weapons.