TL;DR: A team of researchers from Shanghai Jiao Tong University and Xinhua Hospital published a paper in Nature on February 19, 2026 describing DeepRare — an agentic AI system that correctly identifies rare diseases on the first try 64.4% of the time versus 54.6% for experienced specialists with 10+ years of practice. It uses 40 specialized diagnostic tools, reasons transparently, and is already live at 600+ medical institutions. This is one of the most significant medical AI benchmarks ever recorded against real physicians.
Table of Contents
- The Rare Disease Problem Nobody Talks About
- What DeepRare Actually Is
- The Architecture: 40 Tools and a Reasoning Loop
- The Benchmark Numbers That Matter
- Why Beating Doctors Is Harder Than It Sounds
- What Makes DeepRare Different From Prior Medical AI
- Already Deployed — Not Just a Lab Project
- Limitations and What Researchers Acknowledge
- The Competitive Landscape: Medical AI Is Moving Fast
- What This Means For the Future of Diagnosis
The Rare Disease Problem Nobody Talks About
There are approximately 7,000 known rare diseases. They affect roughly 300 million people worldwide — a number that rivals the combined populations of the United States and Canada. Despite that scale, the word "rare" in the name creates a cruel paradox: each individual disease is so uncommon that most physicians encounter specific conditions only once or twice in an entire career, if at all.
The practical result is what patients and advocates call the "diagnostic odyssey." According to a survey by the China Alliance for Rare Diseases, 42% of rare disease patients had been previously misdiagnosed at least once. The average diagnostic delay was 4.26 years. In those years, patients receive treatments for the wrong conditions, sometimes ones that make the underlying disease worse. Families exhaust savings chasing diagnoses. Children reach school age without a name for what is happening to their bodies.
Eighty percent of rare diseases have a genetic origin. That means the answer is often in the genome — but decoding which variant matters, and linking it to a specific disease, requires expertise that is concentrated in a handful of academic medical centers globally. A specialist in Kyoto or Boston who might recognize a particular phenotype in minutes is simply not available to a physician in rural Indonesia, sub-Saharan Africa, or even smaller cities in wealthy countries.
This is the problem DeepRare was built to solve. And the results published in Nature suggest it is solving it — not experimentally, but in direct comparison with the very specialists who currently hold the expertise.
What DeepRare Actually Is
DeepRare is an agentic AI diagnostic system developed jointly by Xinhua Hospital, the Shanghai Jiao Tong University School of Medicine, and the SJTU School of Artificial Intelligence. The paper — titled "An Agentic System for Rare Disease Diagnosis with Traceable Reasoning" — was published in Nature on February 19, 2026, authored by Weike Zhao, Chaoyi Wu, Yanjie Fan, and colleagues.
The key word is "agentic." DeepRare is not a static classification model that takes a list of symptoms and returns a ranked list of conditions based on pattern matching. It is an active reasoning system that behaves more like a doctor working through a differential diagnosis than a search engine looking up likely matches.
In practical terms, this means DeepRare does not just process a patient's clinical presentation once and emit an answer. It forms an initial hypothesis, tests that hypothesis against available evidence, searches current medical literature for supporting or contradicting data, queries genetic databases, reviews prior case studies of similar presentations, and revises its conclusions in real time — iterating through this cycle multiple times before producing a final ranked differential diagnosis with a complete evidence trail attached to each suggestion.
The research team describes this as "an iterative cycle of hypothesis, verification and self-reflection to evaluate diagnostic clues and correct logical gaps." That description will be familiar to anyone who has worked with chain-of-thought reasoning in large language models, but applied here to one of the highest-stakes domains imaginable: telling a patient what disease is destroying their health.
The Architecture: 40 Tools and a Reasoning Loop
DeepRare's power comes from two interlocking components: a large language model serving as the central reasoning engine, and a toolkit of 40 specialized diagnostic instruments that the model can invoke during the reasoning process.
The 40 tools give the system access to capabilities that no single human specialist can match simultaneously. They include:
- Medical literature databases — the ability to search and synthesize current research on rare disease presentations, diagnostic criteria, and case reports
- Genetic variant analyzers — tools that process raw genomic sequencing data and evaluate the clinical significance of specific mutations, insertions, or deletions
- Phenotype ontology systems — structured databases like HPO (Human Phenotype Ontology) that map clinical observations to standardized terminology, enabling cross-referencing across disease definitions
- Differential diagnosis calculators — statistical tools that weight symptom combinations against known disease prevalence and co-occurrence patterns
- Real-world case databases — repositories of previously diagnosed cases with outcomes, enabling the system to find precedents for unusual presentations
The central LLM does not use all 40 tools for every case. It selects which instruments to invoke based on what the current evidence requires — a behavior that mirrors how a human specialist decides which tests to order rather than running every available test on every patient.
Critically, this architecture produces what the researchers call "traceable reasoning." Every diagnostic suggestion comes with a complete explanation of which tools were used, which evidence was considered, which hypotheses were rejected and why, and which pieces of information were most influential in reaching the conclusion. This transparency is not incidental — it is the feature that makes the system clinically usable. A physician receiving a DeepRare recommendation can evaluate the reasoning chain, spot errors in the evidence, and decide whether to follow or override the suggestion with full information.
The Benchmark Numbers That Matter
The evaluation methodology in the Nature paper was rigorous enough to be credible even to skeptics of AI-in-medicine claims. The researchers tested DeepRare against two types of cases:
Retrospective validation: 6,401 cases with known diagnoses. These were historical clinical records where the correct diagnosis had already been established. DeepRare processed each case as a new presentation and its suggestions were compared against the confirmed answer. This is the standard methodology for evaluating diagnostic tools and gives a statistically meaningful baseline.
Prospective head-to-head: 163 difficult cases with live physician comparison. This is where the headline numbers come from. Five experienced physicians — each with more than 10 years of clinical practice in relevant specialties — were presented with the same 163 cases as DeepRare. Neither the physicians nor DeepRare had access to the confirmed diagnoses during the exercise.
The results on the head-to-head cases:
The Recall@1 gap — nearly ten percentage points — is the number that gets attention. But the Recall@3 gap is arguably more clinically meaningful. In practice, a physician reviewing a differential diagnosis considers multiple possibilities. A system that puts the correct answer in its top three suggestions 79% of the time, compared to 66% for specialists, is providing a substantially better shortlist for the physician to work through.
The 95.4% physician endorsement figure adds a qualitative dimension to the quantitative results. Even when the AI and the physician might have reached different rankings, the specialists reviewing DeepRare's reasoning chains agreed that the logic was sound more than 95% of the time. This is not a system that reaches correct answers via nonsensical paths — its reasoning was recognizable and credible to the human experts who evaluated it.
On phenotype-only cases (no genetic data available), DeepRare achieved 57.18% Recall@1 — 23.79 percentage points higher than the previous best international model on the same task. When genomic sequencing data was added, accuracy on complex cases exceeded 70.6%, compared to 53.2% for Exomiser, the widely-used genetic analysis tool that currently serves as the standard of care for many rare disease workups.
Why Beating Doctors Is Harder Than It Sounds
Benchmarks against physicians are common in AI research papers, and they are commonly criticized. The criticism is usually valid: many such comparisons select cases that happen to favor the AI's training distribution, use a small number of physicians, allow the AI to run against simplified versions of the patient presentation, or compare against generalists rather than specialists.
The DeepRare evaluation holds up to most of these critiques better than typical papers in the genre.
The 163-case head-to-head study specifically selected "difficult cases" — presentations that were expected to challenge specialists. These are not easy diagnoses that any trained physician would get right. They were chosen precisely because rare disease diagnosis is hard, even for experts.
The five physicians in the comparative study had 10+ years of relevant experience each. These are not junior residents or general practitioners. They are the kind of specialists whose judgment currently defines the ceiling of what accurate rare disease diagnosis looks like in clinical practice.
The AI was given the same information as the physicians — clinical phenotype data, and in some configurations, genetic sequencing results. There was no information advantage for the AI beyond its access to external knowledge tools, which is precisely the kind of capability difference (access to current literature, large case databases) that a consulting specialist network would also provide.
Finally, the 95.4% physician endorsement rate suggests the results are not a statistical artifact of the AI guessing correctly via a route that physicians would consider medically unreasonable. The logic checked out when experts reviewed it.
What Makes DeepRare Different From Prior Medical AI
Medical AI has been generating headlines for roughly a decade. Radiology AI systems read mammograms with superhuman sensitivity. Dermatology classifiers identify melanoma from photographs. Retinal imaging tools detect diabetic retinopathy faster and cheaper than clinic-based screening. In each of these domains, the AI solved a well-defined pattern recognition problem on structured, high-quality images.
Rare disease diagnosis is different in almost every way that matters.
The inputs are messy and heterogeneous. A patient presenting with possible rare disease might have a mixture of: a physician's free-text notes describing symptoms in non-standardized language, lab values from multiple institutions using different reference ranges, genetic sequencing data from different sequencing platforms, imaging from different machines, HPO terms entered by a genetic counselor, and family history described by a non-medically-trained parent. There is no single clean data modality to train a classifier on.
The output space is enormous. There are 7,000 known rare diseases, many of which share overlapping symptom profiles. Some are so rare that case reports in the literature number in the dozens. No static model trained on historical data can keep pace with newly discovered conditions or recently updated diagnostic criteria.
The reasoning required is explicitly multi-step. Arriving at the right rare disease diagnosis often requires a detective's approach: eliminating more common conditions first, identifying which symptom combinations are genuinely discriminating rather than coincidental, weighing genetic findings against phenotypic expression, and reconsidering early hypotheses when new information contradicts them.
Prior computational tools for rare disease diagnosis — systems like Exomiser, Phenomizer, or AMELIE — addressed pieces of this problem. They were typically single-purpose: a genetic variant prioritizer that assumed the phenotype data was already clean, or a phenotype matcher that assumed genetic data was not available. DeepRare integrates across all these modalities within a single reasoning loop, which is why its performance gaps over prior tools are so substantial.
Already Deployed — Not Just a Lab Project
One of the most significant details in the coverage of the DeepRare paper is easy to miss: this system has been running in the real world since July 2025 — more than seven months before the Nature paper was published.
DeepRare has been available on an online diagnostic platform since that date. As of the paper's publication, over 1,000 professional users across 600+ medical and research institutions globally had registered to use it. That is not a small pilot at a single academic medical center. It is a functioning clinical tool that is actively being used by physicians to assist with rare disease cases at scale.
Researcher Sun Kun from the Shanghai Jiao Tong University team stated that the next phase includes "a global AI alliance for rare disease diagnosis and treatment" and validation of 20,000 real-world cases within six months. The validation plan is significant — it means the team is not treating the Nature paper as an endpoint but as a waypoint. Real-world performance across diverse patient populations will either confirm the benchmarked accuracy or identify gaps that need addressing.
The early deployment also means the system has been receiving feedback from the clinical community throughout its development period. The 95.4% physician endorsement rate in the structured study likely reflects iterative improvement based on that feedback — not a first-generation system thrown over the wall to practitioners.
Limitations and What Researchers Acknowledge
The DeepRare paper is unusually candid about what the system does not yet do.
Prospective validation at scale is incomplete. The 163-case head-to-head study is compelling but limited in size. The planned 20,000-case validation will be necessary to understand how performance varies across different rare disease categories, different patient demographics, and different input data quality levels. Rare diseases that were underrepresented in the training and evaluation datasets may perform worse than the headline numbers suggest.
Geographic and linguistic bias. The system was developed in China, using clinical cases and medical literature that reflect the patient population and healthcare documentation standards of Chinese medical institutions. Rare disease presentations can vary by population genetics, environmental factors, and clinical documentation conventions. Performance may differ in other settings until the system is validated and potentially fine-tuned on data from those populations.
It is a decision support tool, not a replacement. The researchers are explicit about this framing. DeepRare produces ranked hypotheses with reasoning chains for physician review. It does not — and should not — issue diagnoses autonomously. The physician remains the decision maker; DeepRare's role is to ensure that physician has access to a much broader knowledge base than any individual expert could carry.
Access inequality. A system that requires internet connectivity and integration with medical records infrastructure will not reach the patients most affected by diagnostic delays — those in regions with limited healthcare infrastructure. Solving the diagnostic problem technically is not the same as solving it clinically if the patients who most need help cannot access the tool.
The Competitive Landscape: Medical AI Is Moving Fast
DeepRare is not operating in a vacuum. The broader medical AI landscape has accelerated significantly in 2025 and 2026, and the rare disease diagnostic space has drawn particular attention from both academic institutions and commercial players.
Google DeepMind's Med-PaLM 2 demonstrated strong performance on medical licensing examination questions in 2024, but those benchmarks test knowledge recall rather than clinical reasoning in complex ambiguous cases. DeepRare's head-to-head against specialists represents a different and more demanding evaluation standard.
Microsoft's Dragon Copilot, launched at HIMSS 2026, targets the documentation burden in clinical encounters — automating the transcription and coding of physician notes. That is a different problem from DeepRare's diagnostic focus, but it represents the same underlying trend: AI moving from narrow pattern recognition into reasoning-based clinical support.
Lotus Health, which raised $35 million in March 2026 to offer a free AI doctor in 50 languages, is attempting to address healthcare access rather than diagnostic accuracy specifically. Its mission and DeepRare's are complementary — one focuses on reaching underserved patients, the other on solving the hardest diagnostic problems those patients and their physicians face.
AWS launched Amazon Connect Health in March 2026 with five AI agents targeting medical paperwork elimination. Again, this is the administrative layer rather than the diagnostic layer — but the infrastructure investment signals that the largest cloud providers are treating healthcare as a priority vertical.
What makes DeepRare distinctive in this landscape is the specificity of its focus and the credibility of its evaluation. A Nature paper with a physician head-to-head benchmark is a different claim than a product launch announcement. The peer review process for Nature requires methodological rigor that press releases do not. That does not make the product claims made by commercial competitors wrong — but it does mean DeepRare's performance numbers carry a different level of evidence behind them.
What This Means For the Future of Diagnosis
The significance of DeepRare extends well beyond rare diseases, though the rare disease application is compelling enough on its own.
For rare disease patients, the most immediate implication is a potential end to the diagnostic odyssey. If DeepRare can be integrated into the standard rare disease referral pathway — even as a second-opinion tool that physicians consult before concluding a workup — the average time from first symptom to correct diagnosis could shrink from years to weeks. For diseases where early intervention changes outcomes, this is not a quality-of-life improvement. It is a survival benefit.
For healthcare systems globally, a tool that accurately assists with rare disease diagnosis could dramatically reduce the cost of the diagnostic odyssey. Patients spend years being tested for the wrong conditions, seeing specialists who lack the relevant expertise, and receiving treatments that do not address the underlying disease. Each of those encounters costs money and time. A system that short-circuits this process has economic value that far exceeds its licensing cost.
For the broader AI research community, DeepRare demonstrates something important: agentic architectures — systems that coordinate multiple tools through a reasoning loop rather than producing outputs from a single model pass — can achieve performance levels in complex real-world tasks that static classifiers cannot. The 40-tool architecture is not elegant in the way a single end-to-end trained model would be. But it works better than the elegant solution in a domain where working better is what matters.
For medical AI credibility, the Nature publication and the 95.4% physician endorsement rate address one of the central objections to AI diagnostic tools: that they reach correct answers through reasoning that physicians cannot evaluate or trust. DeepRare's traceable reasoning chain allows the physician to see exactly how the conclusion was reached, reject or accept specific pieces of evidence, and integrate the AI's output into their own clinical judgment rather than accepting it as a black box.
For the field of agentic AI, DeepRare is a proof point that the paradigm of "LLM plus tools plus reasoning loop" generalizes beyond software engineering tasks (where systems like Claude Code and Cursor have demonstrated it) to domains with fundamentally different input modalities, output requirements, and stakes. A system that can reason iteratively through heterogeneous medical data is not the same as a system that can iterate on code — but the underlying architectural pattern is the same.
The natural next questions are about scope expansion. Rare disease diagnosis may be unusually well-suited to this approach because the problem space is large enough to benefit from broad knowledge access and specific enough to be rigorously evaluated against specialist benchmarks. But agentic diagnostic systems could, in principle, be applied to complex differential diagnosis in common diseases, treatment selection in oncology, pharmacogenomic drug-drug interaction analysis, and other clinical problems where the gap between what an individual physician knows and what is knowable from the full medical literature is large.
DeepRare is not the last word on medical AI — it is closer to the first sentence of the paragraph that finally matters.
The 300 million people living with rare diseases have been waiting for a long time. The diagnostic odyssey — measured in years of misdiagnosis, failed treatments, and mounting uncertainty — is one of medicine's quieter tragedies, not dramatic enough to dominate headlines but devastating for the individuals and families who experience it. A system that can reliably shorten that journey, trained on the full breadth of medical knowledge and reasoning transparently with evidence that physicians can verify, represents something genuinely new.
It will not reach every patient who needs it immediately. The deployment challenges — infrastructure, validation, integration, access equity — are real. But the performance ceiling has been raised, the validation methodology has been established, and the system is already live. For a field where "promising early results" too often means "years from clinical relevance," DeepRare's position at 600+ institutions while its Nature paper was still in review is an unusual and meaningful data point.
The paper is available at Nature (DOI: 10.1038/s41586-025-10097-9). The DeepRare platform is accessible at medical institutions globally, with registration available to qualified healthcare and research institutions.
Sources: Nature paper — CGTN coverage — Xinhua/SJTU announcement — The Next Web — Medical Xpress