GPT-5.3-Codex-Spark: OpenAI's First Model on Cerebras Hits 1,000 Tokens/Sec

Photo by Taylor Vick on Unsplash Source

OpenAI’s GPT-5.3-Codex-Spark generates code at over 1,000 tokens per second. That is roughly 15x faster than the parent GPT-5.3-Codex model, fast enough that the model’s output appears on screen as quickly as you can read it. The speed comes from a new hardware partnership: Codex-Spark is OpenAI’s first model running on Cerebras silicon instead of Nvidia GPUs. Backed by a $10 billion multiyear deal between the two companies, this is not a lab experiment. It is a production deployment that signals where AI hardware is heading.

The model launched on February 12, 2026, as a research preview for ChatGPT Pro users. It runs on Cerebras’ Wafer Scale Engine 3 (WSE-3), a single wafer-sized chip carrying 4 trillion transistors and 900,000 AI-optimized cores. For developers, the practical upshot is real-time coding feedback with a 128K context window, enough to hold large codebases in memory while responding fast enough to keep up with interactive editing.

Why Cerebras, and Why Now?

Every major AI lab runs on Nvidia. OpenAI’s own infrastructure is built on hundreds of thousands of Nvidia GPUs. So why partner with a startup making dinner-plate-sized chips in a market Nvidia dominates with over 80% share?

The short answer: latency. Nvidia GPUs use High Bandwidth Memory (HBM), which sits off-chip and communicates through external interconnects. For training massive models, that architecture works. For inference, where the goal is to serve individual requests as fast as possible, the memory bottleneck matters.

SRAM vs. HBM: The Technical Gap

Cerebras’ WSE-3 takes a radically different approach. Instead of connecting thousands of discrete GPUs over a network, it builds everything onto a single wafer. The critical difference is memory type: the WSE-3 uses SRAM (Static Random-Access Memory), which sits directly on the chip, roughly 1,000x faster than the HBM4 found on Nvidia’s upcoming Rubin GPUs. No off-chip memory hops, no interconnect overhead.

The numbers make the gap concrete:

Spec	Cerebras WSE-3	Nvidia B200
Transistors	4 trillion	~208 billion
AI compute	125 PFLOPS	~4.5 PFLOPS
Memory type	On-chip SRAM	Off-chip HBM3e
Memory bandwidth	7,000x H100	8 TB/s
Cores	900,000	18,432 CUDA

This architecture is purpose-built for inference. OpenAI called the Cerebras integration a “low-latency serving tier” added to their production stack. GPUs continue handling large-scale, cost-efficient training workloads. Cerebras handles the use cases where speed per request matters more than throughput per dollar.

The $10 Billion Bet

In January 2026, OpenAI and Cerebras signed a deal to bring 750 megawatts of Cerebras-backed compute online in phases through 2028. For context, that is more power than many mid-sized cities consume. The deployment makes it the largest high-speed AI inference installation in the world.

This is not OpenAI abandoning Nvidia. It is OpenAI hedging. The WSE-3 handles interactive inference workloads where latency matters. Nvidia GPUs handle training runs and batch inference where cost efficiency matters. Different chips for different jobs, the same way datacenters use both SSDs and spinning disks.

Cerebras is also eyeing a 2026 IPO that could value the company at over $15 billion. The OpenAI partnership is both a technical collaboration and an anchor customer that makes that IPO viable.

What Codex-Spark Actually Does

Codex-Spark is a smaller, faster variant of GPT-5.3-Codex, optimized specifically for interactive coding. Think of it as the difference between a cargo ship and a speedboat: both move things, but one is built for throughput and the other for responsiveness.

Performance Metrics

The raw speed improvements are significant:

Token generation: 1,000+ tokens per second (vs. ~65 tok/s for the parent model on Nvidia)
Time-to-first-token: 50% reduction compared to GPT-5.3-Codex
Client-server roundtrip overhead: 80% reduction
Per-token processing overhead: 30% reduction
Context window: 128K tokens (text-only at launch)

On agentic software engineering benchmarks like SWE-Bench Pro and Terminal-Bench 2.0, Codex-Spark produces more capable responses than GPT-5.1-Codex-mini while completing tasks in a fraction of the time. It does not match the full GPT-5.3-Codex on raw accuracy for the hardest problems, but for the 80% of coding tasks that are routine (refactoring, bug fixes, test generation, boilerplate), speed matters more than marginal accuracy gains.

How It Fits into the Codex Family

OpenAI now has a three-tier coding model stack:

GPT-5.3-Codex: The full-power model. Strongest on complex, multi-step tasks. Runs on Nvidia GPUs. 256K context. 77.3% on Terminal-Bench 2.0.
GPT-5.3-Codex-Spark: The speed variant. Real-time interactive coding. Runs on Cerebras WSE-3. 128K context. ~15x faster inference.
GPT-5.1-Codex-mini: The lightweight tier. Fast and cheap, lower capability ceiling.

The practical workflow OpenAI envisions: use Spark for real-time editing sessions where you need instant feedback, then hand off complex architectural problems to the full Codex model. The Codex app, CLI, and VS Code extension already support both models, so switching is a model selection toggle, not a workflow change.

The WebSocket-based connection that Spark uses by default also matters for developer experience. Traditional HTTP request-response cycles add latency on every turn. Persistent WebSocket connections keep the pipe open, which is how OpenAI achieves that 80% reduction in roundtrip overhead. For interactive coding, where you might send dozens of small prompts per minute, that adds up fast.

What This Means for the AI Chip Market

The Nvidia monoculture in AI has been an open secret that everyone acknowledges and nobody could change. Nvidia holds over 80% of the AI accelerator market. Every hyperscaler, every AI lab, every startup builds on CUDA. The network effects are enormous: CUDA is a 20-year ecosystem of libraries, tooling, and institutional knowledge.

OpenAI choosing Cerebras for a production deployment does not break that monopoly. But it proves the monopoly has cracks.

The Inference Divergence

Training and inference are splitting into distinct hardware markets. Training requires massive parallelism across thousands of GPUs, long-running jobs where cost per FLOP matters most. Inference requires low latency on individual requests, short bursts where time-to-response matters most.

Nvidia optimized for training first and adapted those GPUs for inference. Cerebras designed for inference from the ground up. The WSE-3’s on-chip SRAM eliminates the memory wall that GPU-based inference hits, which is why it can serve tokens at 1,000/second where a GPU cluster tops out around 65.

This split has implications beyond OpenAI. If the industry moves toward specialized inference hardware, the market gets bigger and more competitive. Amazon is building Trainium chips, Google has TPUs, Microsoft is developing Maia. Each is optimized for different points in the training-inference spectrum. Cerebras occupies the extreme low-latency end.

Developer Impact

For developers using OpenAI’s models through the API or ChatGPT, the hardware behind the curtain is invisible. You call an endpoint, you get tokens back faster. The relevance is in what it enables: coding workflows that were too slow before become viable.

Real-time pair programming with AI has been limited by latency. When a model takes 3-5 seconds to start generating a response, the flow state breaks. At 1,000 tokens per second with 50% faster first-token delivery, the AI keeps pace with a fast typist’s reading speed. That changes the interaction pattern from “prompt, wait, review” to something closer to collaborative typing.

How to Access Codex-Spark

Codex-Spark is currently a research preview with limited access:

Who can use it: ChatGPT Pro subscribers ($200/month)
Where: Codex app (macOS), CLI, VS Code extension
Rate limits: Separate from standard ChatGPT limits (usage does not count against your regular quota)
API access: Limited to design partners; staged rollout planned
Input: Text-only at launch (no multimodal support yet)

OpenAI has not announced pricing for API access. Given that the Cerebras hardware costs are likely higher per chip than Nvidia GPUs, expect a premium tier, possibly as a latency-optimized option alongside the standard GPU-served models.

During the research preview, expect occasional queuing during high-demand periods. OpenAI and Cerebras are still ramping datacenter capacity as part of their phased 750-megawatt deployment.

Safety Notes

OpenAI states that Codex-Spark includes the same safety training as their mainline models, including cyber-related safeguards. Notably, it does not meet the thresholds for high-risk capabilities in cybersecurity or biology, unlike the parent GPT-5.3-Codex model which was classified as “High capability” for cybersecurity. The smaller model size likely accounts for this difference.

Frequently Asked Questions

What is GPT-5.3-Codex-Spark?

GPT-5.3-Codex-Spark is a smaller, faster variant of OpenAI’s GPT-5.3-Codex coding model, optimized for real-time interactive coding. It runs on Cerebras’ Wafer Scale Engine 3 instead of Nvidia GPUs and generates over 1,000 tokens per second, roughly 15x faster than its parent model.

How fast is GPT-5.3-Codex-Spark compared to regular Codex?

Codex-Spark generates over 1,000 tokens per second compared to roughly 65 tokens per second for GPT-5.3-Codex on Nvidia GPUs. It also reduces time-to-first-token by 50% and client-server roundtrip overhead by 80%.

Why does OpenAI use Cerebras chips instead of Nvidia for Codex-Spark?

Cerebras’ WSE-3 uses on-chip SRAM memory that is roughly 1,000x faster than the HBM used in Nvidia GPUs. This eliminates the memory bottleneck that limits inference speed on GPUs. For real-time coding where latency matters more than throughput, the Cerebras architecture is better suited.

Who can access GPT-5.3-Codex-Spark?

Codex-Spark is currently a research preview available to ChatGPT Pro subscribers ($200/month). It can be accessed through the Codex app on macOS, the CLI, and the VS Code extension. API access is limited to design partners with a staged rollout planned.

What is the Cerebras Wafer Scale Engine 3?

The WSE-3 is Cerebras’ third-generation AI chip, built on a single wafer containing 4 trillion transistors and 900,000 AI-optimized cores. It delivers 125 petaflops of compute and uses on-chip SRAM instead of off-chip HBM, enabling significantly faster inference than GPU-based systems.

Why Cerebras, and Why Now?#

SRAM vs. HBM: The Technical Gap#

The $10 Billion Bet#

What Codex-Spark Actually Does#

Performance Metrics#

How It Fits into the Codex Family#

What This Means for the AI Chip Market#

The Inference Divergence#

Developer Impact#

How to Access Codex-Spark#

Safety Notes#

Frequently Asked Questions#

What is GPT-5.3-Codex-Spark?#

How fast is GPT-5.3-Codex-Spark compared to regular Codex?#

Why does OpenAI use Cerebras chips instead of Nvidia for Codex-Spark?#

Who can access GPT-5.3-Codex-Spark?#

What is the Cerebras Wafer Scale Engine 3?#