OpenAI Introduces GPT-5.5: The Smartest and Most Efficient Model the Company Has Ever Built
Artificial Intelligence just reached a new level with the launch of GPT-5.5, OpenAI’s latest model.
And we’re not talking about one of those incremental updates you barely notice in your day-to-day.
This time, the change is real, measurable, and already impacting how engineers, scientists, and professionals across different fields work with computers.
GPT-5.5 arrived with a very different pitch compared to previous releases: being smarter without sacrificing speed, something that has historically been a tough trade-off in the development of large language models. OpenAI says GPT-5.5 matches the per-token latency of GPT-5.4 in real-world production environments, even though it’s a significantly more capable model. On top of that, it uses fewer tokens to complete the same tasks, making it not only more powerful but also more efficient.
Available to Plus, Pro, Business, and Enterprise users on ChatGPT and Codex, the model is already in production. GPT-5.5 Pro, a higher-precision variant, is available for Pro, Business, and Enterprise users. Both versions were also released through the API starting April 24, 2026, along with an updated system card detailing the additional safeguards applied.
From autonomously resolving real GitHub issues to breakthroughs in cutting-edge scientific research, including a new proof about Ramsey numbers in the field of combinatorics, GPT-5.5 seems to be delivering on a promise OpenAI has been building toward for quite some time.
But what exactly changed? What do these results mean in practice? And why are so many people inside and outside OpenAI describing this launch as an inflection point? 🚀
That’s what we’re going to break down here.
What Makes GPT-5.5 Different From Previous Models
To understand the real impact of GPT-5.5, it helps to look at what previous OpenAI models could do and where they stumbled. GPT-5.4 was already a solid model, but it had clear limitations when it came to maintaining consistency across long tasks, handling chained instructions without losing the thread, and especially when it needed to act autonomously within real development environments and knowledge work.
GPT-5.5 isn’t just a faster or cheaper version. It represents a shift in how the model processes and executes high-complexity tasks. According to OpenAI, instead of needing to carefully manage every step, you can now hand GPT-5.5 a complex, messy task and trust that it will plan, use tools, verify its own work, navigate ambiguity, and keep going until the task is done.
One of the differences developers who’ve already put GPT-5.5 into production talk about most is how it handles ambiguous instructions and incomplete context. While earlier versions tended to fill in gaps in generic ways, GPT-5.5 shows a greater ability to identify ambiguity before acting on it. Dan Shipper, founder and CEO of Every, described GPT-5.5 as the first coding model he used with true conceptual clarity. He tested the model by reproducing a real scenario: after days trying to solve a post-launch bug with one of his best engineers, he went back to the original state of the problem and asked GPT-5.5 to analyze it. GPT-5.4 couldn’t solve it. GPT-5.5 arrived at the same solution the human engineer had implemented.
Another key point is GPT-5.5’s orientation toward agency. This means it wasn’t optimized just to answer questions well, but to make decisions within chained workflows. The gains are especially strong in agentic coding, computer use, knowledge work, and early-stage scientific research — areas where progress depends on contextual reasoning and sustained action over time.
Codex and GPT-5.5: The Duo Transforming Software Development
Codex running on GPT-5.5 is a completely different experience from previous versions. What used to be a smart code-autocomplete tool now works as an engineering agent capable of reading entire repositories, understanding project architecture, identifying problems, and proposing solutions aligned with the existing codebase’s patterns.
The coding benchmarks are impressive and concrete:
- Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination: GPT-5.5 hit 82.7% accuracy, state of the art, compared to 75.1% for GPT-5.4 and 69.4% for Claude Opus 4.7.
- SWE-Bench Pro, which evaluates resolution of real GitHub issues: GPT-5.5 reached 58.6%, solving more end-to-end tasks in a single pass than previous models.
- Expert-SWE, an internal OpenAI evaluation for long-duration coding tasks with an estimated average human completion time of 20 hours: GPT-5.5 scored 73.1% versus 68.5% for GPT-5.4.
And across all three of these benchmarks, GPT-5.5 improved on GPT-5.4’s scores while using fewer tokens. That’s a rare combination: smarter and more economical at the same time.
In practice, developers using the environment in production report a real change in their work pace. Pietro Schirano, CEO of MagicPath, described a case where GPT-5.5 merged a branch with hundreds of frontend changes and refactoring into a main branch that had also changed substantially, resolving everything at once in about 20 minutes.
Senior engineers who tested the model said GPT-5.5 was remarkably stronger than GPT-5.4 and Claude Opus 4.7 in reasoning and autonomy, catching problems early and anticipating testing and review needs without being explicitly asked. In one case, an engineer asked the model to re-architect a commenting system in a collaborative markdown editor and came back to find a stack of 12 diffs practically complete.
An NVIDIA engineer with early access to the model was even more emphatic: Losing access to GPT-5.5 is like having a limb amputated.
Michael Truell, co-founder and CEO of Cursor, summed it up: GPT-5.5 is remarkably smarter and more persistent than GPT-5.4, with stronger coding performance and more reliable tool use. It stays on task for significantly longer without stopping prematurely, which matters especially for the complex, extended work that Cursor users delegate to the model.
Knowledge Work: Documents, Spreadsheets, and Real Computer Use
The same capabilities that make GPT-5.5 excellent at coding also make it powerful for everyday computer work. Because the model is better at understanding user intent, it can navigate the full knowledge work cycle more naturally: finding information, understanding what matters, using tools, verifying output, and transforming raw material into something useful.
In Codex, GPT-5.5 outperforms GPT-5.4 at generating documents, spreadsheets, and slide presentations. Alpha testers said it surpassed previous models in tasks like operations research, spreadsheet modeling, and transforming messy business inputs into structured plans.
Some internal examples from OpenAI itself show the breadth of this capability:
- The Communications team used GPT-5.5 in Codex to analyze six months of speaking request data, build a scoring and risk framework, and validate an automated Slack agent.
- The Finance team used Codex to review 24,771 K-1 tax forms totaling 71,637 pages, accelerating the task by two weeks compared to the previous year.
- On the Go-to-Market team, an employee automated weekly business report generation, saving 5 to 10 hours per week.
Today, more than 85% of OpenAI uses Codex every week, spanning roles from software engineering to finance, communications, marketing, data science, and product management.
On professional work benchmarks, the numbers back up this performance:
- GDPval, which tests agents on knowledge work across 44 occupations: GPT-5.5 scored 84.9%.
- OSWorld-Verified, which measures whether a model can operate real computer environments autonomously: 78.7%.
- Tau2-bench Telecom, for complex customer service workflows: 98.0% without prompt tuning.
- FinanceAgent: 60.0%.
- Internal investment banking modeling tasks: 88.5%.
GPT-5.5 in Scientific Research: When AI Starts Discovering What Humans Haven’t Seen Yet
If GPT-5.5’s impact on software development is already impressive, what it’s doing in scientific research is on another level entirely. GPT-5.5 shows clear gains in scientific and technical research workflows that demand more than just answering a hard question. Researchers need to explore an idea, gather evidence, test assumptions, interpret results, and decide what to try next. GPT-5.5 is better at persisting through that cycle than other models.
In terms of scientific benchmarks, the results are significant:
- On GeneBench, a new evaluation focused on scientific data analysis in genetics and quantitative biology, GPT-5.5 scored 25.0% versus 19.0% for GPT-5.4. GPT-5.5 Pro reached 33.2%.
- On BixBench, a real-world bioinformatics and data analysis benchmark, GPT-5.5 reached 80.5%, leading among models with published scores.
- On FrontierMath Tier 4, the hardest math problems, GPT-5.5 hit 35.4% compared to 27.1% for GPT-5.4 and 22.9% for Claude Opus 4.7.
But the most surprising example might be its direct contribution to pure mathematics. An internal version of GPT-5.5 with a custom harness helped discover a new proof about Ramsey numbers, one of the central objects in combinatorics. Ramsey numbers ask, roughly speaking, how large a network needs to be before some kind of order is guaranteed. The proof found by GPT-5.5 concerned a long-standing asymptotic fact about off-diagonal Ramsey numbers, and it was subsequently verified in Lean. We’re not talking about code or explanation here, but a surprising and useful mathematical argument in a core area of research.
Derya Unutmaz, professor of immunology and researcher at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a gene expression dataset with 62 samples and nearly 28,000 genes. The model produced a detailed research report that not only summarized the findings but also identified key questions and insights that, according to him, would have taken his team months.
Bartosz Naskręcki, assistant professor of mathematics at Adam Mickiewicz University in Poznań, Poland, used GPT-5.5 in Codex to build an algebraic geometry application from a single prompt in 11 minutes, visualizing the intersection of quadric surfaces and converting the resulting curve into a Weierstrass model.
Safety and Safeguards: The Most Rigorous Level Yet
OpenAI stated that it is launching GPT-5.5 with the strongest set of safeguards ever implemented. The model was evaluated across the company’s full suite of safety and preparedness frameworks, went through work with internal and external red teamers, targeted testing for advanced cybersecurity and biology capabilities, and received feedback from nearly 200 early access partners before launch.
GPT-5.5’s biological/chemical and cybersecurity capabilities are rated High under OpenAI’s Preparedness Framework, though they did not reach the Critical level.
In practical cybersecurity terms, OpenAI is taking a three-pronged approach:
- Strengthened safeguards: tighter controls around high-risk activities, sensitive cyber requests, and additional protections against repeated misuse. OpenAI acknowledges that some users may find the stricter classifiers initially inconvenient as they are refined over time.
- Expanded access for cyber defense: models with cyber permissions are being made available through the Trusted Access for Cyber program, starting with Codex. Organizations responsible for defending critical infrastructure can request access to models like GPT-5.4-Cyber.
- Government partnerships: OpenAI is exploring how advanced AI can support the defensive work of officials responsible for critical systems, from tax data to power grids and water supply.
On the CyberGym benchmark, GPT-5.5 scored 81.8% compared to 79.0% for GPT-5.4 and 73.1% for Claude Opus 4.7. In internal Capture-the-Flag challenges, the model reached 88.1%.
Infrastructure and Efficiency: How to Serve a Bigger Model Without Getting Slower
Serving GPT-5.5 at the same latency as GPT-5.4 required rethinking inference as an integrated system, not a collection of isolated optimizations. GPT-5.5 was co-designed, trained, and served on NVIDIA GB200 and GB300 NVL72 systems. And in an especially interesting detail, Codex and GPT-5.5 themselves were instrumental in hitting the performance targets.
Codex helped the team move faster from idea to testable implementation, drafting approaches, connecting experiments, and helping identify which optimizations were worth deeper investment. GPT-5.5, in turn, helped find and implement key improvements in the inference stack itself. In other words: the model helped improve the infrastructure that serves it.
One of these improvements was in load balancing and partitioning heuristics. Before GPT-5.5, requests on an accelerator were split into a fixed number of chunks. Codex analyzed weeks of production traffic patterns and wrote custom heuristic algorithms to partition and balance the workload in an optimized way. The impact was significant: more than a 20% increase in token generation speed.
On the Artificial Analysis Coding Index, GPT-5.5 delivers state-of-the-art intelligence at half the cost of competing frontier coding models.
Long Context and Abstract Reasoning: Major Improvements
GPT-5.5 shows especially significant gains on long-context tasks. On the OpenAI MRCR v2 8-needle benchmark in the 512K to 1M token range, the model hit 74.0% compared to just 36.6% for GPT-5.4 and 32.2% for Claude Opus 4.7. That’s a dramatic improvement that makes the model far more reliable for working with large codebases and extensive documents.
In abstract reasoning, the results on ARC-AGI-2 Verified are also impressive: 85.0% for GPT-5.5 versus 73.3% for GPT-5.4. This benchmark is considered one of the most rigorous tests of generalization and reasoning capability for AI models.
Availability and Pricing
For API developers, gpt-5.5 is available through the Responses and Chat Completions APIs at $5 per million input tokens and $30 per million output tokens, with a 1 million token context window. Batch and Flex processing are available at half the standard rate, while Priority processing costs 2.5x the standard rate.
gpt-5.5-pro is offered through the API at $30 per million input tokens and $180 per million output tokens, for tasks requiring higher precision.
In Codex, GPT-5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window. It’s also available in Fast mode, generating tokens 1.5x faster at 2.5x the cost.
While GPT-5.5 is priced higher than GPT-5.4, OpenAI points out that it is both smarter and much more token-efficient. In Codex, the experience was tuned so that GPT-5.5 delivers better results with fewer tokens than GPT-5.4 for most users.
What These Results Actually Mean in Practice
It’s impossible to talk about GPT-5.5 without going through the benchmarks, but it’s also important not to treat those numbers as the whole story. Benchmarks measure what they were designed to measure, and they don’t always capture the nuances that make a difference in real-world use. A model can perform impressively in controlled tests and still fail in frustrating ways when dropped into a real project with ambiguous requirements and a legacy codebase full of accumulated complexity.
What sets GPT-5.5 apart in this context isn’t just test scores, but the accounts from real users who are running the model in production and describing a qualitatively different experience. When engineers, researchers, and professionals across different fields start saying they’ve changed the way they work because of a tool, that’s a stronger signal than any single benchmark number.
Justin Boitano, VP of Enterprise AI at NVIDIA, put it this way: GPT-5.5 delivers the sustained performance needed for heavy execution work. It’s more than faster coding. It’s a new way of working that helps people operate at a fundamentally different speed.
GPT-5.5 wasn’t designed to be OpenAI’s final model. It’s part of a deliberate development trajectory, with each release serving as a foundation for the next. Understanding GPT-5.5 not just as a product, but as a milestone in a continuous line of evolution, is the most honest way to interpret what this launch represents for the future of Artificial Intelligence. 🤖
