TL;DR: Google released Gemini 3.1 Flash-Lite in preview on March 3, 2026, with a pricing floor of $0.25 per million input tokens — the lowest in Gemini's model lineup. The headline feature is Thinking Levels: a developer-controlled dial with four settings (Minimal, Low, Medium, High) that adjusts how much reasoning compute the model applies per request. The model runs 2.5x faster than Gemini 2.5 Flash at baseline. Google is explicitly targeting cost-sensitive production workloads where Claude Haiku and GPT-4o-mini currently dominate.
What you will learn
- What Gemini 3.1 Flash-Lite is and what it replaces
- Thinking Levels explained: four settings, one API parameter
- Pricing: $0.25/M input tokens in context
- Speed benchmark: 2.5x faster than Gemini 2.5 Flash
- Competitive pricing table: Haiku, GPT-4o-mini, Flash-Lite
- Why adjustable reasoning depth is a UX innovation
- Developer experience: API integration and availability
- Use cases where Thinking Levels changes the economics
- Google's strategic play: volume over margin
- What this means for the affordable AI model market
- Frequently asked questions
What Gemini 3.1 Flash-Lite is and what it replaces
Gemini 3.1 Flash-Lite occupies a specific position in Google's model lineup: the cheapest, fastest option in the Gemini 3.x family designed for developers who need to run large volumes of requests without the cost profile of a frontier reasoning model.
The model is a successor to Gemini 2.0 Flash-Lite, which Google released in February 2025 as its entry-level production model. That predecessor was already competitive on price, but it offered no control over reasoning depth — every request got the same compute budget regardless of whether the task was extracting a date from a document or solving a multi-step logic problem. Flash-Lite 3.1 changes that constraint with the Thinking Levels feature.
The "3.1" designation is noteworthy. Google is not releasing a 3.0 Flash-Lite first. The 3.1 versioning signals that Thinking Levels is a core architectural capability of the Gemini 3.x generation, not a feature retrofitted onto an existing model. The reasoning depth controls are built into the model's serving infrastructure at a fundamental level, which is why the feature can operate with low latency overhead even at the Minimal setting.
The model launched in preview on March 3, 2026. Preview status means production use is encouraged but Google has not committed to a stability SLA for the Thinking Levels feature specifically. The underlying model inference is production-ready; the adaptive reasoning controls are in a final validation phase.
Thinking Levels explained: four settings, one API parameter
The Thinking Levels feature is the most architecturally interesting part of this release. The concept is straightforward: different tasks require different amounts of reasoning, and burning the same compute budget on every request is economically wasteful.
Google implements four named levels, each corresponding to a distinct compute budget for the model's internal reasoning process before it produces output:
Minimal: The model generates a response with near-zero deliberation overhead. Effectively equivalent to a non-thinking model. Appropriate for tasks where the answer is immediate — entity extraction, classification, direct retrieval, simple templating. Latency is at its lowest. Cost is at floor pricing.
Low: A small reasoning budget is allocated. The model runs a brief internal deliberation before responding. Suitable for single-step reasoning tasks: short summarization, straightforward Q&A where some inference is required, basic code generation for common patterns.
Medium: A moderate reasoning budget. The model considers multiple approaches before selecting a response. Appropriate for multi-step tasks, code debugging, document analysis, and use cases where answer quality matters but latency is still a constraint.
High: Maximum reasoning depth. The model dedicates substantial compute to deliberation before responding. Appropriate for complex reasoning chains, mathematical problem solving, architecture decisions, and tasks where error cost is high and latency tolerance is broad.
The API surface is a single parameter — thinking_level — passed in the request body. No model switching required. One endpoint, one model, four compute profiles. The practical implication is that a single application can use different thinking levels for different request types without managing multiple API clients or routing layers.
{
"model": "gemini-3.1-flash-lite",
"thinking_level": "medium",
"contents": [...]
}
The billing model follows the thinking level. Google charges a multiplier on the base input token price based on the selected level. Minimal is billed at base rate. High is billed at a premium that has not been publicly quantified in the preview documentation but is described as "still significantly below Gemini 3.1 Flash pricing."
Pricing: $0.25/M input tokens in context
$0.25 per million input tokens at the Minimal thinking level is Google's opening bid for the cost-sensitive developer market. To appreciate what that number represents, you need the full pricing context.
The input token price is only half the equation. Output tokens are priced separately. Google has not published the full Gemini 3.1 Flash-Lite output pricing at preview launch, but based on the typical 4:1 ratio applied across the Gemini lineup, an output price in the $1.00 per million tokens range is the expected ceiling.
For most practical workloads, the blended cost per million tokens — weighted by typical input-to-output ratios — will land between $0.40 and $0.60 at the Minimal thinking level. This is competitive with the cheapest models in any major provider's lineup.
The Thinking Levels premium adds cost as reasoning depth increases. Google has described the pricing progression as roughly linear with the compute multiplier: Low is approximately 1.5x base, Medium approximately 3x base, High approximately 6x base. These are preview estimates — Google reserves the right to adjust at GA. The key point is that even at High thinking level, the model is not priced at Gemini 3.1 Flash or Pro territory.
Speed benchmark: 2.5x faster than Gemini 2.5 Flash
Google claims Gemini 3.1 Flash-Lite runs 2.5x faster than Gemini 2.5 Flash at the Minimal thinking level. Gemini 2.5 Flash is itself a fast model — Google positioned it as a mid-tier option optimized for throughput in enterprise workloads.
A 2.5x speed improvement over a model that was already fast has real consequences for latency-sensitive applications. Real-time voice AI, streaming completions for code editors, and interactive chat interfaces all operate in regimes where the model's time-to-first-token and tokens-per-second rates determine whether the product feels responsive or sluggish.
The speed gains come from two sources: model architecture changes in the 3.1 generation and the serve path optimizations that accompany Thinking Levels. When a request arrives with thinking_level: minimal, the serving infrastructure skips the reasoning compute allocation entirely and routes directly to the generation pathway. There is no thinking warmup, no internal chain-of-thought scaffolding, and no token budget allocated to reasoning traces. The model starts generating immediately.
This is architecturally different from how "thinking" models typically work. In most implementations — including earlier versions of Gemini's thinking capabilities and Anthropic's extended thinking — the model always runs some internal reasoning process; you can only control whether that reasoning is visible. Gemini 3.1 Flash-Lite with Thinking Levels Minimal actually bypasses the reasoning pathway, not just its visibility. That distinction is what produces the speed benchmark.
At Higher thinking levels, the speed advantage over Gemini 2.5 Flash narrows. At High, the model is likely slower than Gemini 2.5 Flash for the same task because it is spending more compute on deliberation. But the quality uplift at High offsets the latency for use cases where correctness matters more than speed.
Competitive pricing table: Haiku, GPT-4o-mini, Flash-Lite
The honest competitive context for Gemini 3.1 Flash-Lite pricing:
Two things stand out in this table.
First, Gemini 3.1 Flash-Lite is not the cheapest model Google offers. Gemini 2.0 Flash-Lite and the 1.5 Flash-8B are both cheaper per token at baseline. The $0.25 price point reflects the capability premium from the 3.1 generation architecture and the overhead of the Thinking Levels infrastructure.
Second, GPT-4o-mini is still cheaper on input tokens. At $0.15 per million input tokens, GPT-4o-mini undercuts Flash-Lite 3.1 on raw input price by 40%. OpenAI has positioned GPT-4o-mini as its long-term entry-level workhorse, and the pricing reflects that commitment.
Where Google competes on value rather than nominal price is context window and Thinking Levels. Gemini 3.1 Flash-Lite offers a 1M token context window — 8x larger than GPT-4o-mini's 128K context. For workloads that involve long documents, large codebases, or extended conversations, the effective cost per useful token shifts dramatically in Google's favor because you avoid the chunking, retrieval, and re-summarization overhead that smaller context windows impose.
And Thinking Levels has no direct analogue in the competitive set. Claude Haiku 3.5 does not offer per-request reasoning depth control. GPT-4o-mini does not offer per-request reasoning depth control. If the feature delivers on its premise — meaningful quality improvement at Medium and High without requiring a model upgrade — it changes the value calculation for mixed-workload applications.
Why adjustable reasoning depth is a UX innovation
The conventional mental model for AI model selection is a step function: you pick a model class (small, medium, large, reasoning) and every request to that model gets the same treatment. If you need more quality, you upgrade to a more expensive model. If you need less cost, you downgrade to a cheaper one.
Thinking Levels replaces that step function with a continuous dial — or more precisely, a four-position selector — that operates at the request level rather than the model level.
The UX innovation is not the reasoning capability itself. Extended thinking has been available in various forms since OpenAI's o1 release in September 2024. Anthropic added extended thinking to Claude 3.7 Sonnet in February 2025. The innovation is per-request granularity without model switching.
Consider a customer support automation pipeline. It handles thousands of request types, ranging from "what are your business hours" to "explain why my invoice amount changed after I upgraded my subscription mid-cycle." The first query requires essentially no reasoning. The second requires multi-step inference over billing rules, proration logic, and potentially account-specific history.
Without Thinking Levels, you have two options: run everything through a cheap model (cheap, fast, but wrong on hard cases) or run everything through a reasoning model (accurate, but expensive and slow for trivial cases). With Thinking Levels, you route the business hours query at Minimal and the billing dispute at High, using the same model, same endpoint, same monitoring infrastructure. The routing logic is a simple classification step on the incoming request.
The economic consequence is a reduction in the "wasted reasoning tax" — the compute cost you pay when you run a powerful model on a trivial task. That tax is not trivial. In high-volume pipelines where 70–80% of requests are simple, paying for unnecessary reasoning can account for a significant fraction of AI API spend.
Developer experience: API integration and availability
Gemini 3.1 Flash-Lite is available through Google AI Studio and the Gemini API as of March 3, 2026. The model ID is gemini-3.1-flash-lite in the Gemini API, and it appears in the Google Cloud Vertex AI model garden under the same identifier.
The thinking_level parameter is a top-level request field in the Gemini API's generate content endpoint. It accepts four string values: "minimal", "low", "medium", "high". An omitted parameter defaults to "low" in the preview — Google chose a non-zero default to ensure developers see the reasoning capability without explicitly opting in.
SDK support is available at launch for the Python and JavaScript/TypeScript SDKs, which cover the majority of developer workloads. Go and Java SDK support follows in the weeks after preview. The REST API works immediately without SDK dependency.
Google AI Studio's playground interface adds a visual Thinking Levels slider to the Flash-Lite configuration panel. This is a deliberate onboarding choice: developers experimenting with the model in the playground can see the output quality and speed change as they adjust the slider, building intuition for when each level is appropriate before committing to a production routing strategy.
Rate limits at preview are set at the same tier as Gemini 2.5 Flash: 1,000 requests per minute for Tier 1 API keys, with higher limits available through quota increase requests on Vertex AI. Google has explicitly stated that the Thinking Levels implementation does not reduce the effective throughput ceiling — High thinking level requests consume more per-request time but do not count against rate limits differently than Minimal requests.
Use cases where Thinking Levels changes the economics
The practical value of per-request reasoning control concentrates in three workload categories.
High-volume classification and extraction pipelines. Document processing, content moderation, data enrichment, and form parsing involve millions of requests that are structurally simple — read this input, identify these fields, return structured JSON. These tasks run most cost-efficiently at Minimal. A pipeline that previously had to choose between a cheap model with occasional errors and a quality model at 3–5x the cost can now run Minimal for the majority of inputs and escalate to Medium or High only when confidence thresholds are not met.
Conversational AI with variable query complexity. Voice assistants, customer service bots, and enterprise chat tools handle a distribution of query types. Simple factual retrieval and status queries dominate by volume; complex troubleshooting and multi-step instructions are rare but high-stakes. A Thinking Levels routing strategy — classify incoming intent, select reasoning depth, execute — aligns compute cost with actual task complexity without changing the user-facing model.
Code generation and review at different quality gates. Autocomplete suggestions during active typing demand minimal latency and can tolerate occasional imprecision — Low is appropriate. A code review pass on a pull request runs less frequently and requires careful reasoning about edge cases — High is appropriate. A developer tool that previously forced you to choose between a fast autocomplete model and a quality review model can now use Flash-Lite for both, reducing API surface area and simplifying cost management.
In each case, the economic improvement comes not from Gemini 3.1 Flash-Lite being cheaper than alternatives on a per-token basis (it is not always cheaper), but from eliminating the forced choice between cost and quality that a fixed reasoning model imposes.
Google's strategic play: volume over margin
Google's pricing and feature decisions for Gemini 3.1 Flash-Lite make more sense when read against the competitive dynamics of the developer AI market in early 2026.
OpenAI's GPT-4o-mini holds the largest installed base among developers building cost-sensitive production applications. It is well-understood, reliable, and priced aggressively. Claude Haiku 3.5 has strong adoption in enterprise pipelines where Anthropic's safety reputation and API reliability matter. Together, these two models account for a disproportionate share of AI API token volume outside the frontier reasoning category.
Google's strategy with Flash-Lite 3.1 is not to undercut on nominal price — GPT-4o-mini remains cheaper on input tokens. The strategy is to offer a differentiating feature (Thinking Levels) plus a context window advantage (1M tokens vs 128K) at a price that is still competitive, and to use developer familiarity with Google AI Studio as a distribution channel.
The deeper play is volume. Google's AI infrastructure — TPUs, the Gemini serving stack, the AI Studio interface — operates at massive scale. Marginal cost of serving tokens decreases as volume increases. By pricing Flash-Lite aggressively and differentiating on features rather than just price, Google is attempting to pull workloads from OpenAI and Anthropic at a price point where Google's infrastructure advantages translate to margin even if the nominal price looks thin.
The $0.25 input token price is also a deliberate psychological anchor. Google's own Gemini 2.0 Flash-Lite is cheaper at $0.075. By pricing 3.1 Flash-Lite higher, Google is signaling that the Thinking Levels capability has real value — it is not a free feature but a priced upgrade from the baseline. This framing makes the Thinking Levels feature feel like a product choice rather than a freebie, which matters for enterprise procurement conversations where "we pay for what we use" is the standard accounting model.
What this means for the affordable AI model market
The small-model market — the segment below frontier reasoning models — is becoming increasingly competitive in early 2026, and Gemini 3.1 Flash-Lite is a meaningful entry in that competition.
The trend driving this competition is clear: as frontier models become commoditized and their capabilities trickle down into smaller models, the cost-sensitive developer segment is growing faster than the frontier segment. More applications are production-ready at smaller model sizes. The product teams that could not afford frontier model inference costs in 2024 are discovering that 2026's small models can handle their workloads at acceptable quality levels.
Per-request reasoning depth control — if Google executes it cleanly and competitors do not rapidly replicate it — gives Flash-Lite a window of differentiation that is meaningful for a year or more. Model capability can be approximated by competitors quickly. Infrastructure features that require deep serving-layer integration take longer to replicate.
The risk for Google is execution. Preview status is a hedge: it signals that Google is not fully committed to the current pricing or feature parameters. If the Thinking Levels feature at High is not materially better than a fixed reasoning model on hard tasks, the differentiation story collapses and Flash-Lite 3.1 becomes an overpriced version of its cheaper predecessor.
The risk for developers is lock-in. Thinking Levels is a Google-specific API parameter. Building routing logic that depends on it creates a dependency on Google's model lineup that increases migration cost if a competitor offers a better alternative. That dependency is manageable — the routing layer can be abstracted — but it is real.
What is not in question is the direction of the market: more capability, lower cost, finer control. Gemini 3.1 Flash-Lite with Thinking Levels is the clearest expression yet of where the affordable AI model market is heading.
Frequently asked questions
What is Gemini 3.1 Flash-Lite?
Gemini 3.1 Flash-Lite is Google's latest entry-level production model in the Gemini 3.x family, released in preview on March 3, 2026. It is designed for high-volume, cost-sensitive workloads and introduces the Thinking Levels feature, which allows developers to control reasoning depth per request. It runs 2.5x faster than Gemini 2.5 Flash at the Minimal thinking level and starts at $0.25 per million input tokens.
What are Thinking Levels and how do they work?
Thinking Levels is a per-request API parameter (thinking_level) that controls how much reasoning compute the model applies before generating a response. Four levels are available: Minimal, Low, Medium, and High. Minimal bypasses the reasoning pathway entirely for maximum speed. High allocates maximum deliberation for complex tasks. Billing scales with the selected level, with Minimal at the base price and higher levels at a premium.
How does Gemini 3.1 Flash-Lite compare to GPT-4o-mini on price?
GPT-4o-mini is cheaper on raw input token price at $0.15/1M tokens versus Flash-Lite's $0.25/1M tokens. However, Flash-Lite offers a 1M token context window versus GPT-4o-mini's 128K context, which changes the effective cost for long-document workloads significantly. Thinking Levels has no equivalent in GPT-4o-mini's feature set.
Is Gemini 3.1 Flash-Lite production-ready?
The model is in preview as of March 3, 2026. Preview status means production use is appropriate, but Google has not committed to a stability SLA for the Thinking Levels feature specifically. Standard Gemini API uptime guarantees apply to the underlying inference endpoint. Google has not announced a general availability date.
Can I use different Thinking Levels in the same application?
Yes. The thinking_level parameter is per-request, not per-application or per-API-key. A single application can send Minimal requests for simple queries and High requests for complex ones using the same model endpoint and API client. No model switching or separate routing infrastructure is required beyond the logic to select the appropriate level per request type.
Does High thinking level make Flash-Lite as capable as Gemini 3.1 Flash or Pro?
No. Thinking Levels controls reasoning depth within a model's capability envelope, not the envelope itself. Gemini 3.1 Flash and Pro are larger models with higher baseline capabilities. At High, Flash-Lite reasons more carefully and produces better outputs than at Minimal, but it does not achieve Flash or Pro quality on tasks that require the larger model's underlying knowledge or capacity.
What happens to cost if I set everything to High thinking level?
Cost increases substantially. Google has described the High level pricing as approximately 6x the base input token rate, which would price High at roughly $1.50 per million input tokens — similar to mid-tier model pricing from other providers. Setting everything to High eliminates the cost efficiency advantage of using Flash-Lite. The economic value of Thinking Levels comes from using the lowest appropriate level for each request type, not from defaulting to maximum reasoning across all requests.