- "Our code is messy and we need to clean it up." — No business urgency implied - "We have technical debt." — No specificity, no impact, no call to action - "The engineers are frustrated with the architecture." — Personal, not a business concern - "We need to pay down debt before we can build new features." — Creates an adversarial dynamic

Managing Technical Debt in Fast-Growing AI Startups

Q: When the 20% Rule Is Not Enough

There are situations where 20% is insufficient: - You are in the middle of a technical debt crisis (velocity has collapsed, incident rate is high) - You are planning a major architectural shift (new model provider, new retrieval architecture) - You have just completed a sprint to ship a major feature with known debt accumulated In these situations, declare a debt sprint: a full sprint where the entire engineering team focuses on debt reduction, with no new features. The business pain of a debt sprint is real but bounded. The business pain of a debt crisis is unbounded.

TL;DR: Technical debt in AI startups is not just messy code — it includes model versioning debt, prompt debt, data pipeline debt, and evaluation debt, all of which compound faster than traditional software debt. This post gives you a complete framework: how to classify AI-specific debt, how to measure it, when to repay vs. rewrite, the 20% rule for sustainable debt management, the technical debt register as a management tool, and how to communicate debt to non-technical stakeholders. Built from real experience building and advising AI companies that have felt this pain.

The AI-Specific Technical Debt Problem
The 4 Types of Technical Debt in AI Startups
How to Measure Technical Debt
The 20% Rule for Debt Repayment
Tech Debt Triage Framework
When to Rewrite vs. Refactor
The Technical Debt Register
Communicating Debt to Non-Technical Stakeholders
Team Structure for Debt Management
FAQ

The AI-Specific Technical Debt Problem

Every startup accumulates technical debt. It is the price of moving fast. The accepted wisdom is that early-stage companies should take on debt deliberately — ship quickly, learn from customers, fix the mess later. Most experienced engineering leaders know how to manage this cycle for traditional software.

AI startups are different, and most teams do not realize how different until they are drowning.

In a conventional SaaS product, technical debt accumulates in specific, bounded ways: messy code architecture, outdated dependencies, insufficient test coverage, poorly designed database schemas. These are painful to fix but they are also understandable. An engineer can look at a messy module, understand what it does, and refactor it.

In an AI product, the debt manifests in ways that are harder to see, harder to measure, and dramatically more dangerous to ignore:

Your model changes without your product changing. When an AI provider updates a model, your product behavior changes — potentially for every single user — without your engineering team touching a line of code. If you have not built model versioning into your architecture, you will not know this happened until customers complain.

Your prompts are a dependency you cannot pin. The prompt that drives a core feature of your product is fragile in ways that traditional code is not. A slight change in wording changes output behavior. A change in system prompt context affects every downstream response. And prompts are usually managed informally — in a spreadsheet, hardcoded in a file somewhere, or scattered across three different locations that are slightly different from each other.

Your data pipeline is both infrastructure and product. In AI products, the quality of your outputs is directly tied to the quality of your training data, fine-tuning data, or retrieval data. Debt in the data pipeline is invisible until it causes quality degradation — at which point, diagnosing it requires understanding the full pipeline, which is often not documented.

Your evaluation infrastructure is probably nonexistent. Most early-stage AI companies do not have systematic evaluation frameworks. They test new model versions or prompt changes manually, eyeballing a handful of examples. This works when the product is small and the team knows the expected behavior intuitively. It stops working at scale, and by the time it fails, you are in production with a degraded product and no systematic way to find the regressions.

The compounding effect is severe: a startup that accumulates all four types of debt simultaneously — model debt, prompt debt, data pipeline debt, and evaluation debt — will find itself unable to ship meaningful improvements, because every change to any part of the system has unpredictable ripple effects.

I have seen AI startups that were technically capable of building impressive demos completely stall at scale because of this compounding. The founders could not understand why their engineering team felt stuck. The engineers could not communicate why the architecture they had built was making every new feature harder than the last. The technical debt was invisible in the demo but fatal in production.

This post is the guide I wish had existed when I first encountered this problem.

The 4 Types of Technical Debt in AI Startups

Type 1: Intentional Debt

Intentional debt is the debt you take on deliberately. You know the right way to build something and you choose a faster, messier approach because the cost of doing it properly now outweighs the benefit.

Classic examples in AI startups:

Hardcoding a model provider instead of building a model abstraction layer, because you only need one provider today
Writing prompts directly in application code rather than in a prompt management system, because you are still figuring out the core product behavior
Using a third-party evaluation tool that does not scale, because building your own evaluation framework would take six weeks you do not have

Intentional debt is the least dangerous type — because you know it is there and you can plan for it. The danger comes when intentional debt gets treated as permanent rather than temporary.

"Technical debt is like a credit card — a little debt can be useful to get started, but if you carry it indefinitely the interest payments will eventually consume everything." — Ward Cunningham, who coined the term

Type 2: Accidental Debt

Accidental debt is the debt you do not know you are taking on. It accumulates from:

Decisions made early in the product's life when you did not fully understand the domain
Features added without considering their interaction with the existing architecture
Engineers joining the team and extending code without understanding its full context

In AI products, accidental debt often emerges from the model integration layer. In the early days, someone writes a wrapper around the LLM API that works fine for the initial use case. Over 18 months, twelve engineers touch it, adding parameters, conditional branches, and special-case handling. The wrapper is now a 600-line file that no one fully understands, that handles edge cases for every feature that has ever used it, and that breaks in unpredictable ways whenever anyone touches it.

Type 3: Bit Rot

Bit rot is debt that accumulates passively — the codebase degrades simply because the world around it changes while the code does not.

In traditional software, bit rot affects dependencies that become outdated, APIs that get deprecated, and libraries that lose support. In AI software, bit rot is dramatically accelerated:

Model providers deprecate older model versions with sometimes 3-6 months of notice
Embedding models are replaced by newer, better versions that change the behavior of retrieval systems
Third-party data sources change their format or availability
Libraries for ML inference are updated with breaking changes on irregular schedules

A startup that builds aggressively in AI and does not maintain its dependencies will find 12-18 months of bit rot creating a maintenance crisis on a compressed timeline.

Type 4: Model Drift

Model drift is unique to AI systems and has no close analogue in traditional software engineering. It is the accumulation of debt caused by the gap between how the model behaves today and how your system was designed to expect it to behave.

Model drift has two forms:

Upstream drift: The model provider changes the model. This can be a scheduled update, a training data update that shifts behavior without a version change, or a change in safety filtering that causes more outputs to be declined.

Distribution drift: Your users change. The queries users are sending your system today are different from the queries your prompts and retrieval system were designed to handle six months ago. Your product has grown, new use cases have emerged, and your AI system has not been updated to reflect the actual distribution of real-world usage.

Both types of drift are silent. The system appears to be working. Outputs are still generated. But the quality has degraded, edge cases are more frequent, and the user experience has quietly gotten worse. This is the most dangerous type of technical debt in AI because it is the hardest to detect and the hardest to diagnose.

Debt Type Summary

Type	Source	Visibility	Speed of accumulation	AI-specific?
Intentional	Deliberate trade-off	High	Controlled	No
Accidental	Unknown unknowns	Low	Medium	No
Bit rot	World changes around static code	Medium	Medium-high	Accelerated in AI
Model drift	Model or user behavior changes	Very low	High	Yes — unique to AI

How to Measure Technical Debt

You cannot manage what you cannot measure. Here are the metrics I use to quantify technical debt and its impact.

Developer Velocity Metrics

These measure the cost of technical debt in terms of engineering productivity.

Metric	What it measures	How to collect	Healthy benchmark
Lead time for changes	Time from commit to production	CI/CD pipeline logs	Under 1 day for small changes
Deployment frequency	How often you deploy to production	Deployment logs	Multiple times per week
Change failure rate	% of deployments causing incidents	Incident tracking	Under 15%
Mean time to recover (MTTR)	How long to restore service after incident	Incident tracking	Under 1 hour for P1
Feature cycle time	Time from feature kickoff to shipped	Project management tool	Trending flat or down over time

The most diagnostic signal is feature cycle time trend. If shipping a small feature took 3 days six months ago and takes 8 days today, the debt accumulated in the past six months is measurable in engineering days.

AI-Specific Quality Metrics

These metrics are specific to AI products and measure the health of the model, prompt, and data pipeline layer.

Evaluation coverage: What percentage of your core user journeys have automated evaluations? If the answer is less than 60%, you have evaluation debt that will prevent you from confidently shipping model or prompt changes.

Prompt regression rate: When you make a change to a prompt or model version, what percentage of previously working test cases break? A healthy system should have this below 5%. If you do not know the answer, you do not have evaluation infrastructure.

Data pipeline freshness: How stale is the data your retrieval system uses? If you have a RAG system, when was the data last refreshed? If you have fine-tuned a model, when was the training data last updated? Staleness beyond a certain threshold is model drift debt.

Model version lag: Are you running on the current version of the models you depend on? Being more than one major version behind on a core model dependency is a debt item that will require emergency work when the older version is deprecated.

Deployment Frequency as a Proxy

The single metric that most reliably tracks accumulated technical debt is deployment frequency. When a codebase is clean and well-tested, engineers deploy often because changes are safe and easy to make. When a codebase is heavily indebted, engineers become cautious because changes have unpredictable side effects. Deployment frequency drops.

Plot your deployment frequency on a monthly basis for the past year. If it is declining or flat while engineering headcount is growing, technical debt is the most likely explanation.

The Developer Experience Survey

Once per quarter, run a simple 5-question anonymous survey with your engineering team:

How often does technical debt slow your work? (1-5 scale)
Which part of the codebase causes the most friction? (open text)
How confident are you in shipping changes to production without unexpected side effects? (1-5 scale)
How often do you discover that a change you made broke something unrelated? (1-5 scale)
What one thing would most improve your day-to-day engineering experience? (open text)

This survey surfaces the debt that is not visible in the metrics — the friction, the fear, the workarounds that engineers have built up and normalized without realizing how much they cost.

The 20% Rule for Debt Repayment

The 20% rule is a heuristic for sustainable technical debt management: allocate 20% of every engineering sprint to debt repayment and infrastructure improvement.

This number is not arbitrary. Engineering teams that allocate less than 10% to debt work see their velocity decline over time — debt compounds faster than it is being repaid. Teams that allocate more than 30% often fall behind on product delivery in ways that create business risk. 20% is the zone that allows a growing product to maintain velocity while keeping debt under control.

For AI startups, I recommend 25% rather than 20%, for two reasons:

First, AI-specific debt (evaluation infrastructure, model versioning, prompt management) requires proactive investment that does not have an equivalent in traditional software. These are not "cleaning up" activities — they are foundational infrastructure that needs to be built before you need it urgently.

Second, the cost of AI debt compounding is higher than traditional software debt. When a traditional software system degrades, users encounter bugs. When an AI system degrades, users lose trust in the product quality in ways that are harder to recover from.

Implementing the 20% Rule

Make it structural, not discretionary. If the 20% debt budget is subject to negotiation every sprint, it will be zero most sprints. Product pressure is always real and always urgent. Debt repayment is never urgent until it is catastrophic. The only way to protect the budget is to make it a standing allocation.

Give the debt budget to the engineers, not the managers. The people closest to the code know where the debt is most painful. Let them nominate and prioritize debt work within the 20% budget. This increases the quality of what gets addressed and increases engineer morale — they feel trusted to improve the system they work in.

Track it separately. Debt work should be tracked in your project management system with a distinct label. This lets you measure whether you are actually spending 20% and see what impact it is having on velocity metrics over time.

Set quarterly debt goals. Within the 20% budget, set a quarterly goal: "This quarter we will complete the model versioning abstraction layer." This gives debt work the same planning discipline as product work.

When the 20% Rule Is Not Enough

There are situations where 20% is insufficient:

You are in the middle of a technical debt crisis (velocity has collapsed, incident rate is high)
You are planning a major architectural shift (new model provider, new retrieval architecture)
You have just completed a sprint to ship a major feature with known debt accumulated

In these situations, declare a debt sprint: a full sprint where the entire engineering team focuses on debt reduction, with no new features. The business pain of a debt sprint is real but bounded. The business pain of a debt crisis is unbounded.

Tracking Debt ROI

Every significant debt repayment item should have a before-and-after measurement. Before you fix the integration layer, measure: average time to ship a feature that touches it, number of incidents per week caused by it. After the fix, measure the same things. This is how you demonstrate to the business that debt repayment pays off — not as an abstract argument, but as a specific before/after comparison.

Tech Debt Triage Framework

Not all technical debt deserves equal attention. The triage framework I use evaluates debt on two dimensions: business impact if unaddressed, and cost to fix.

The Debt Quadrant

                    HIGH BUSINESS IMPACT
                           |
         Q1: URGENT        |        Q2: STRATEGIC
    High impact, low cost  |   High impact, high cost
    Fix immediately        |   Plan and schedule
                           |
    -------------------------------------------
                           |
         Q3: QUICK WIN     |        Q4: DEFER
    Low impact, low cost   |   Low impact, high cost
    Fix when convenient    |   Accept and monitor
                           |
                    LOW BUSINESS IMPACT

Q1: Urgent (High Impact, Low Cost) — Drop everything and fix. These are the items where the risk is real, the cost to fix is contained, and the benefit is immediate. Example: a hardcoded API key that is not rotated, an evaluation gap on a core user journey, a prompt that is producing incorrect outputs in an edge case that is becoming more common.

Q2: Strategic (High Impact, High Cost) — Plan carefully and schedule properly. These are architectural changes that require significant engineering effort but are essential for the product to scale. Example: replacing a monolithic LLM integration with a model abstraction layer, rebuilding the data pipeline from a batch to a streaming architecture, implementing proper evaluation infrastructure from scratch.

Q3: Quick Win (Low Impact, Low Cost) — Address opportunistically. These are items that do not cause real business harm but improve code quality and engineer experience. Address them during sprint buffer time or as part of the 20% debt budget.

Q4: Defer (Low Impact, High Cost) — Accept and monitor. These are items where the cost of fixing is disproportionate to the benefit. Document them, monitor whether they move to a different quadrant as the business evolves, and do not feel bad about leaving them.

Classifying Debt Items

For each debt item you identify, answer two questions:

Business impact if unaddressed (next 6 months):

High: Could cause customer churn, security incident, significant quality degradation, or engineering velocity collapse
Low: Causes inconvenience, mild inefficiency, or aesthetic code issues

Cost to fix:

High: Requires more than 2 engineer-weeks, involves multiple system components, or requires significant coordination
Low: Requires less than 2 engineer-weeks and is well-understood in scope

Place each item in the appropriate quadrant. Your Q1 items become your immediate priority list. Your Q2 items become your quarterly debt goals. Your Q3 items go into the 20% budget backlog. Your Q4 items go into the debt register with a monitoring note.

AI Debt Triage Examples

Debt Item	Business Impact	Fix Cost	Quadrant	Action
No model version pinning in production	High — silent regressions on every provider update	Low — 1-2 days	Q1	Fix immediately
No evaluation suite for core outputs	High — cannot safely ship model or prompt changes	High — 4-6 weeks	Q2	Plan for next quarter
Prompts hardcoded across 6 files	Medium — inconsistency and maintenance burden	Medium — 1 week	Q3	Address in 20% budget
Old embedding model (no issues yet)	Low now, high in 12 months when deprecated	Medium — 1 week migration	Q3/Q2	Schedule before deprecation
Unused experiment code from 18 months ago	Low	Low	Q3	Delete when touching related file

When to Rewrite vs. Refactor

One of the hardest technical decisions at any fast-growing company is whether to refactor an existing system or rewrite it from scratch. Joel Spolsky famously argued that rewrites are almost always wrong. The practical reality is more nuanced.

The Case Against Rewrites

The conventional wisdom is strong: rewrites are expensive, take longer than expected, and often reproduce the same problems in a different form. The bugs in the original system are not all in the code — some of them are in the team's understanding of edge cases and user behavior that accumulates over time. A rewrite loses that implicit knowledge.

Additionally, the opportunity cost of a rewrite is real: the features you are not shipping, the technical improvements you are not making to the existing system. A six-month rewrite is six months of technical investment with no customer-facing output.

The Case For Rewrites in AI Products

AI products have specific characteristics that make rewrites more justifiable than in traditional software:

Architectural debt in AI compounds faster. When your LLM integration layer was designed around a specific model and usage pattern, and your product has evolved to need five different models, twelve different prompt types, and a retrieval system, the original architecture may be so far from what you need that incremental refactoring would take longer than a clean rewrite.

AI infrastructure has high integration coupling. Traditional software debt often lives in isolated components. AI debt tends to be deeply integrated — your prompt management, your model versioning, your evaluation system, and your data pipeline all affect each other. Fixing any one of them properly often requires rethinking the others.

Model changes will force architectural changes anyway. If you are planning to migrate from one model provider to another, or to move from pure LLM generation to a hybrid retrieval-augmented approach, you may need to rethink major portions of the system regardless. At that point, a principled rewrite may be less work than incremental migration.

The Rewrite Decision Framework

Use this framework before committing to a rewrite:

Question	Rewrite signal	Refactor signal
Is the problem in the architecture or the implementation?	Architecture	Implementation
Can the system be incrementally improved in 6 sprints?	No	Yes
Do we fully understand what the new system needs to do?	Yes	Doesn't matter — refactor regardless
Do we have 30%+ engineering capacity to spare for 8+ weeks?	Yes	No
Have we documented what the current system does, including edge cases?	Yes — knowledge will transfer	No — do this before anything else

A rewrite is justified when: the architecture is the problem, incremental improvement is demonstrably slower than a clean start, the team understands the requirements fully, and you have the capacity to execute without starving the product roadmap.

A rewrite is not justified when: you cannot fully specify what the new system needs to do, you are hoping the rewrite will solve organizational problems, or you are making the decision out of frustration rather than analysis.

The Incremental Rewrite: The Best of Both

The best approach for most AI startups is neither a full rewrite nor pure refactoring — it is an incremental rewrite using the Strangler Fig pattern.

The Strangler Fig pattern (named for the plant that grows around a tree and gradually replaces it) involves:

Building the new system alongside the old one
Routing new functionality to the new system
Gradually migrating existing functionality from the old system to the new
Eventually retiring the old system

This approach allows you to maintain product delivery while making architectural progress, test the new system in production with limited exposure before full commitment, and preserve the institutional knowledge embedded in the current system by migrating it incrementally rather than replacing it at once.

For AI startups, a common Strangler Fig migration is moving from a monolithic LLM call to a proper orchestration layer: you build the new layer for new features first, validate it works, then incrementally migrate the old features one by one until the old layer can be safely retired.

The Technical Debt Register

The technical debt register is a living document that tracks all known debt items. It is the single source of truth for what debt exists, what it costs, and what the plan is.

Most companies track bugs in a bug tracker and features in a product management tool. Technical debt often lives nowhere — it exists in engineers' heads, in Slack messages, in GitHub issue comments. When debt is invisible, it cannot be managed.

Structure of the Debt Register

Field	Description
ID	Unique identifier (TD-001, TD-002, etc.)
Title	Short description of the debt item
Type	Intentional / Accidental / Bit rot / Model drift
Component	Which part of the system is affected
Business impact	High / Medium / Low
Fix cost	High / Medium / Low (in engineer-weeks)
Quadrant	Q1 / Q2 / Q3 / Q4
Date identified	When this was first logged
Owner	Who is responsible for resolving this
Resolution plan	Refactor / Rewrite / Monitor / Accept
Target quarter	When this is planned to be addressed
Status	Open / In progress / Resolved
Notes	Additional context, links to related issues

AI-Specific Debt Register Fields

For AI startups, add these additional fields:

Field	Description
AI debt type	Model versioning / Prompt / Data pipeline / Evaluation
Model dependency	Which model or provider is affected
Evaluation coverage	Is there automated testing for this component? Y/N
Drift risk	Is this component at risk of silent degradation? H/M/L
Last reviewed	When was this debt item last assessed?

Maintaining the Register

The debt register is only useful if it is maintained. Here is the minimum viable maintenance process:

Sprint planning: Review Q1 items. Confirm at least one is being addressed this sprint.

Monthly: Scan all Open items for any that have moved from Q3 or Q4 to Q1 or Q2 (business impact has increased). Update status of in-progress items.

Quarterly: Full review and re-scoring of all items. Archive resolved items. Add new items identified during the quarter. Set target quarters for Q2 items.

Annually: Assess whether any Q4 items should be promoted to a higher priority or permanently accepted as-is.

Sample Debt Register Entry (AI-Specific)

ID: TD-014
Title: No model version pinning — production uses latest API default
Type: Intentional (created as shortcut during initial build)
Component: LLM integration layer
Business impact: High — silent behavioral regressions when provider updates model
Fix cost: Low — 1-2 engineer days to implement version pinning + staging test suite
Quadrant: Q1
Date identified: 2026-02-12
Owner: [Lead engineer name]
Resolution plan: Refactor — pin versions in config, add model version to all API calls
Target quarter: Q1 2026 (immediate)
Status: In progress
AI debt type: Model versioning
Model dependency: Primary LLM provider, embedding model
Evaluation coverage: N — no automated regression tests exist yet
Drift risk: High — any model update could silently change output behavior
Last reviewed: 2026-03-01
Notes: Linked to TD-018 (evaluation infrastructure gap). Should be resolved together.

Making the Register Accessible

The debt register should be accessible to the whole engineering team, not just engineering leads. Engineers who are working in a system and identify debt should be able to log it immediately. Engineers who are planning work should consult it to understand what constraints they are working within.

The register should also be visible in summary form to the product and leadership team. This is how you make the debt conversation proactive rather than reactive.

Communicating Debt to Non-Technical Stakeholders

This is where most engineering leaders struggle. Technical debt is real, it has real costs, but explaining it to a CEO, board member, or investor without technical background is genuinely hard.

The fundamental problem: technical debt is invisible until it causes pain, and by the time it causes pain, the conversation is happening in crisis mode rather than planning mode.

The solution is to translate debt into business language before the crisis.

Frame 1: The Velocity Tax

"Our current technical debt is costing us approximately X engineer-days per sprint in overhead — debugging unpredictable behavior, working around architectural limitations, managing the operational burden of the current infrastructure. If we invested Y sprints in debt reduction, we estimate we would recover Z engineer-days per sprint within N quarters. That is equivalent to hiring additional engineers without the headcount cost."

This framing works because it converts an abstract engineering concern into a staffing and productivity question that any executive understands.

Frame 2: Reliability Risk

"Our current evaluation coverage for AI outputs is approximately X%. This means that when we make changes to our model, prompts, or data pipeline, we have limited visibility into whether those changes are degrading quality for Y% of user journeys. A quality regression in those uncovered journeys could affect customer retention before we catch it. Investing Z in evaluation infrastructure this quarter reduces that blind spot from A% to B%."

This framing converts evaluation debt into a risk management conversation. It answers the question "what could go wrong?" — language that boards and investors speak.

Frame 3: Opportunity Cost

"The architectural limitations in our current system mean it takes X days to ship a new feature of this type. If we make this investment now, similar features would take Y days. Over the next Z features on our roadmap, that is a difference of N engineer-weeks of capacity — capacity that would otherwise go toward shipping the features you have asked for."

This framing connects debt repayment directly to the product roadmap. It answers "what do I get?" rather than asking for engineering hygiene investment with no visible business payoff.

What Not to Say

"Our code is messy and we need to clean it up." — No business urgency implied
"We have technical debt." — No specificity, no impact, no call to action
"The engineers are frustrated with the architecture." — Personal, not a business concern
"We need to pay down debt before we can build new features." — Creates an adversarial dynamic

The Quarterly Debt Briefing

I recommend a brief (10 minute) debt update to the leadership team each quarter, using this structure:

Q1 items that were addressed this quarter and their measured impact on velocity or reliability
The current Q1 items being addressed next quarter
Any new items that emerged this quarter at Q1 or Q2 level
The engineering velocity trend and its relationship to debt investment

This keeps the conversation proactive and positions engineering as a responsible steward of technical infrastructure rather than a team that springs debt crises on the business without warning.

Team Structure for Debt Management

How you structure your engineering team affects how technical debt accumulates and gets managed. Here are the structural patterns I recommend for AI startups at different stages.

Early Stage (1-8 Engineers)

At this stage, there is no dedicated infrastructure or platform team. Everyone is building product. Technical debt management needs to be embedded in the engineering culture rather than assigned to a specific role.

Recommended practices:

One engineer takes ownership of the debt register as an ongoing responsibility (rotates quarterly)
The founding engineer or CTO owns the 20% debt allocation and protects it in sprint planning
Debt review is a standing agenda item in the weekly engineering sync (10 minutes maximum)

AI-specific: Designate one engineer as "model health owner" — responsible for monitoring model version updates, maintaining the evaluation suite, and flagging drift signals. This is not a full-time role at this stage, but having it explicitly owned prevents it from being everyone's responsibility and therefore no one's.

The minimum viable AI infrastructure checklist for this stage:

Item	Why it matters	Effort
Model version pinned in config	Prevents silent regressions	1 day
Prompts stored in version control	Enables rollback and diff	1 day
20 golden test cases for core outputs	Basic regression coverage	2-3 days
Weekly model health check (manual)	Catches drift before customers do	30 min/week
Data pipeline freshness monitoring	Alerts on stale retrieval data	2 days

Growth Stage (8-25 Engineers)

At this stage, you can afford to create a small platform or infrastructure sub-team (2-3 engineers) whose primary responsibility is the foundational systems that every product team depends on.

Responsibilities of the platform sub-team:

Model abstraction layer and versioning
Evaluation infrastructure
Data pipeline reliability
Shared prompt management system
Developer tooling and CI/CD

The platform team does not work on features. They work on the systems that make feature development safe and fast. They are measured by the velocity of the product teams, the reliability of the AI stack, and the reduction in incident rate.

Recommended practices:

Platform team attends product team planning to understand upcoming needs
Platform team has veto power over decisions that affect shared infrastructure
Debt register is owned by the platform team but visible to all engineers
Quarterly debt goals are set collaboratively between platform team and engineering leadership

Scaling Stage (25+ Engineers)

At scale, the platform sub-team becomes a full Platform Engineering team. The debt management becomes a more formal program:

Structured debt sprints: One sprint per quarter is designated as an infrastructure sprint. All product teams participate with 50% of their capacity. The platform team coordinates.

Debt SLA: Each debt item in Q1 has a maximum time-to-resolution SLA based on severity:

Critical (production incident risk): 2 weeks
High (velocity impact): 1 quarter
Medium (quality risk): 2 quarters

Engineering excellence lead: A senior engineer role focused on code quality, architectural review, and debt management practices across all teams. This person reviews all major PRs that touch shared infrastructure and owns the debt register.

AI-specific at scale: A dedicated ML platform team separate from the product platform team. Their scope covers model evaluation, fine-tuning infrastructure, prompt versioning, and the data pipelines that feed all AI features. The distinction matters because ML infrastructure has different reliability requirements and different expertise requirements than general application infrastructure.

FAQ

Q: How do we handle technical debt when we are racing a competitor?

The standard advice of "just ship fast and fix later" does not work for AI products because AI debt is silent and compounds rapidly. The practical answer: take on intentional, documented debt — not accidental debt. For each shortcut you take, log it in the debt register with an explicit plan for when it will be addressed. This gives you the speed you need while ensuring the debt does not become invisible. The worst outcome is taking on debt you do not know about, because you cannot plan to repay it.

Q: Our engineering team keeps saying we have too much technical debt but we cannot see the business impact. What should we ask?

Ask these specific questions: "What feature took longer to ship this quarter than you expected because of technical debt, and how much longer?" and "What outage or incident this quarter was caused or worsened by technical debt?" If the engineers cannot answer these questions specifically, the debt may not be as severe as feared. If they can answer them with specifics, you now have the business impact you need to justify addressing it.

Q: When should we pin our model versions vs. always using the latest?

Always pin your model versions in production. This prevents silent degradation from model updates. Use a three-environment model: production (pinned, stable), staging (candidate testing of new model versions with your evaluation suite), and development (latest available for exploration). When you want to upgrade, test in staging against all your golden test cases, confirm no regressions, then promote to production on a known schedule. The upgrade cadence depends on how fast the provider moves and how much your product depends on the latest capabilities — but the decision should always be deliberate, not automatic.

Q: How do we build an evaluation suite when we do not know what "correct" looks like for generative AI outputs?

Start with the cases where you do know. For most AI products, there are "golden examples" — inputs for which you know exactly what a good output looks like. These are your first evaluation cases. Expand from there by: sampling production outputs and having team members rate them (creating labeled data over time), using LLM-as-judge for outputs where human labeling is too slow to be practical, and building behavioral evaluations that measure specific properties (response latency, output length, refusal rate, format compliance) that are easier to verify than subjective content quality. Even 20 golden test cases is far better than zero. Start there.

Q: How do we handle prompt versioning without a dedicated prompt management system?

At minimum, store prompts in version control — not hardcoded in application code, but in a separate file or directory tracked by git. This gives you prompt history, rollback capability, and diff tooling for prompt changes. Label each prompt with a version number and maintain a changelog. This is not as good as a dedicated prompt management system (which provides production/staging separation, A/B testing, and analytics), but it is dramatically better than no versioning. Build the proper system when you have the engineering capacity — typically around the 8-15 engineer stage.

Q: What is the biggest technical debt mistake AI startups make?

Not building evaluation infrastructure early enough. Every other type of debt is painful and expensive but ultimately visible. The lack of evaluation infrastructure is uniquely dangerous because it makes every other problem invisible. Without evaluations, you cannot confidently upgrade model versions, you cannot safely change prompts, you cannot know whether a new feature has regressed existing functionality, and you cannot measure whether your AI system is getting better or worse over time. Build your evaluation suite before you think you need it. By the time you think you need it, you already needed it six months ago.

Q: How do we communicate urgency about technical debt without causing panic?

Use the velocity tax framing described in the stakeholder communication section, and present it with data. For example: "Our deployment frequency has declined 40% over the past two quarters, and our post-incident analysis shows that 60% of incidents involved components we identified as high-debt in our register. We are proposing one debt sprint next quarter to address the three Q1 items, which we estimate will recover this velocity within two quarters." Specific, data-driven, action-oriented. That is the format that converts a vague concern into a funded initiative.

Q: We inherited a codebase with significant AI debt from a previous team. Where do we start?

Do the debt audit first before making any changes. Spend two weeks documenting what exists: what models are used, where prompts live, whether there are any evaluations, what the data pipeline looks like, what the deployment process is. Build the debt register from this audit. Then classify every item into the quadrant framework. Fix the Q1 items immediately — these are the ones with high impact and low cost, and addressing them builds credibility with the team and the business. Then plan the Q2 items as a structured program. Do not try to fix everything at once; prioritization is the work.

Udit Goenka is a founder, investor, and builder. He has built and scaled AI-powered products and writes about engineering strategy, product development, and the specific challenges of building companies at the intersection of software and AI.

Let's Build Something Together

Weekly Newsletter