Introducing Codex

Introducing Codex

Categories: AI news

Today we’re launching a research preview of Codex: a cloud-based software engineering agent that can work on many tasks in parallel. Codex can perform tasks for you such as writing features, answering questions about your codebase, fixing bugs, and proposing pull requests for review; each task runs in its own cloud sandbox environment, preloaded with your repository.

Codex is powered by codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result. We’re starting to roll out Codex to ChatGPT Pro, Enterprise, and Team users today, with support for Plus and Edu coming soon.

How Codex works

Today you can access Codex through the sidebar in ChatGPT and assign it new coding tasks by typing a prompt and clicking “Code”. If you want to ask Codex a question about your codebase, click “Ask”. Each task is processed independently in a separate, isolated environment preloaded with your codebase. Codex can read and edit files, as well as run commands including test harnesses, linters, and type checkers. Task completion typically takes between 1 and 30 minutes, depending on complexity, and you can monitor Codex’s progress in real time.

Once Codex completes a task, it commits its changes in its environment. Codex provides verifiable evidence of its actions through citations of terminal logs and test outputs, allowing you to trace each step taken during task completion. You can then review the results, request further revisions, open a GitHub pull request, or directly integrate the changes into your local environment. In the product, you can configure the Codex environment to match your real development environment as closely as possible.

Codex can be guided by AGENTS.md files placed within your repository. These are text files, akin to README.md, where you can inform Codex how to navigate your codebase, which commands to run for testing, and how best to adhere to your project's standard practices. Like human developers, Codex agents perform best when provided with configured dev environments, reliable testing setups, and clear documentation. 

On coding evaluations and internal benchmarks, codex-1 shows strong performance even without AGENTS.md files or custom scaffolding.

 
1248Number of attempts0.650.70.750.80.85Accuracy (pass@k)codex-1o3-highSWE-Bench Verified
 
11%67%70%75%OpenAI Internal SWE taskso1-higho4-mini-higho3-highcodex-10%40%80%Accuracy (%)

23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded. codex-1 was tested at a maximum context length of 192k tokens and medium ‘reasoning effort’, which is the setting that will be available in product today. For details on o3 evaluations, see here⁠.

Our internal SWE task benchmark is a curated set of real-world internal SWE tasks at OpenAI.

Building safe and trustworthy agents

We're releasing Codex as a research preview, in line with our iterative deployment strategy. We prioritized security and transparency when designing Codex so users can verify its outputs - a safeguard that grows increasingly more important as AI models handle more complex coding tasks independently and safety considerations evolve. Users can check Codex’s work through citations, terminal logs and test results. When uncertain or faced with test failures, the Codex agent explicitly communicates these issues, enabling users to make informed decisions about how to proceed. It still remains essential for users to manually review and validate all agent-generated code before integration and execution.

Code-review screenshot with a test-file overlay verifying quoted filenames, plus summary and passing tests on a blue backdrop.
Code-review screenshot with a black terminal overlay showing one passing test for quoted filenames; summary and diff of the ‘Fix /diff error with special characters’ change visible on a blue-pastel background.

Aligning to human preferences

A primary goal while training codex-1 was to align outputs closely with human coding preferences and standards. Compared to OpenAI o3, codex-1 consistently produces cleaner patches ready for immediate human review and integration into standard workflows.

Please fix the following issue in the astropy/astropy repository. Please resolve the issue in the problem below by editing and testing code files in your current code execution session. The repository is cloned in the /testbed folder. You must fully solve the problem for your answer to be considered correct. Problem statement:Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels Consider the following model: ```python from astropy.modeling import models as m from astropy.modeling.separable import separability_matrix cm = m.Linear1D(10) & m.Linear1D(5) ``` It's separability matrix as you might expect is a diagonal: ```python >>> separability_matrix(cm) array([[ True, False], [False, True]]) ``` If I make the model more complex: ```python >>> separability_matrix(m.Pix2Sky_TAN() & m.Linear1D(10) & m.Linear1D(5)) array([[ True, True, False, False], [ True, True, False, False], [False, False, True, False], [False, False, False, True]]) ``` The output matrix is again, as expected, the outputs and inputs to the linear models are separable and independent of each other. If however, I nest these compound models: ```python >>> separability_matrix(m.Pix2Sky_TAN() & cm) array([[ True, True, False, False], [ True, True, False, False], [False, False, True, True], [False, False, True, True]]) ``` Suddenly the inputs and outputs are no longer separable? This feels like a bug to me, but I might be missing something?
 
Expand
Codex
astropy/modeling/separable.py
@@ -242,7 +242,7 @@ def _cstack(left, right):
242242
cright = _coord_matrix(right, 'right', noutp)
243243
else:
244244
cright = np.zeros((noutp, right.shape[1]))
245
- cright[-right.shape[0]:, -right.shape[1]:] = 1
245
+ cright[-right.shape[0]:, -right.shape[1]:] = right
246246
 
247247
return np.hstack([cleft, cright])
astropy/modeling/tests/test_separable.py
@@ -148,3 +148,17 @@ def test_custom_model_separable():
148148
 
149149
assert not model_c().separable
150150
assert np.all(separability_matrix(model_c()) == [True, True])
151
+
152
+
153
+def test_nested_compound_model_separability():
154
+ """Regression test for nested compound models in separability_matrix."""
155
+ lin_compound = models.Linear1D(10) & models.Linear1D(5)
156
+ nested = models.Pix2Sky_TAN() & lin_compound
157
+ expected = np.array([
158
+ [True, True, False, False],
159
+ [True, True, False, False],
160
+ [False, False, True, False],
161
+ [False, False, False, True],
162
+ ])
163
+ result = separability_matrix(nested)
164
+ assert_allclose(result, expected)
OpenAI o3
astropy/modeling/separable.py
@@ -241,8 +241,26 @@ def _cstack(left, right):
241241
if isinstance(right, Model):
242242
cright = _coord_matrix(right, 'right', noutp)
243243
else:
244
+ # Insert the separability information contained in the right-hand
245
+ # operand into the bottom-right corner of the combined matrix.
246
+ #
247
+ # Prior to this change the implementation *discarded* the actual
248
+ # separability matrix of the right operand and replaced it with an
249
+ # array of ones. This effectively marked every output that originated
250
+ # from the right‑hand side as depending on *all* of its inputs, i.e.
251
+ # it treated the nested model as completely non‑separable. When the
252
+ # right operand was itself a compound model this led to overly
253
+ # conservative (and therefore incorrect) separability matrices—for
254
+ # example when combining an already analysed compound model with the
255
+ # ``&`` operator, as reported in gh‑issue #xxxx (see regression tests
256
+ # added alongside this fix).
257
+ #
258
+ # The correct behaviour is analogous to the handling of the *left*
259
+ # array operand a few lines above: we should embed the pre‑computed
260
+ # separability matrix of the right operand into the appropriate slice
261
+ # of the result without altering it.
244262
cright = np.zeros((noutp, right.shape[1]))
245
- cright[-right.shape[0]:, -right.shape[1]:] = 1
263
+ cright[-right.shape[0]:, -right.shape[1]:] = right
246264
 
247265
return np.hstack([cleft, cright])
astropy/modeling/tests/test_separability_matrix.py
1
+"""Tests for the separability_matrix utility.
2
+
3
+This file adds a regression test for the bug discussed in
4
+astropy/astropy#14966 (see also the accompanying change in
5
+``astropy.modeling.separable._cstack``).
6
+
7
+The failure was triggered when computing the separability matrix for a
8
+compound model where the right‑hand side of an ``&`` operation was *itself*
9
+another compound model. In that case the implementation treated the nested
10
+model as completely non‑separable, resulting in a matrix that incorrectly
11
+reported dependencies between all of its outputs and inputs.
12
+
13
+The test below reproduces the example from the original bug report and makes
14
+sure that the corrected implementation returns the expected block‑diagonal
15
+matrix.
16
+"""
17
+
18
+import numpy as np
19
+
20
+
21
+from astropy.modeling import models as m
22
+from astropy.modeling.separable import separability_matrix
23
+
24
+
25
+def test_nested_compound_model_separability():
26
+ """Ensure separability_matrix works for nested compound models."""
27
+
28
+ # A simple separable compound model consisting of two independent 1‑D
29
+ # linear transformations.
30
+ cm = m.Linear1D(10) & m.Linear1D(5)
31
+
32
+ # Combine the separable compound model with a *non‑separable* two‑dimensional
33
+ # TAN projection. The projection goes on the left‑hand side of the ``&``
34
+ # operator while the nested compound model ``cm`` is on the right.
35
+ nested = m.Pix2Sky_TAN() & cm
36
+
37
+ # The expected correlation matrix is block‑diagonal: the two outputs of
38
+ # the TAN projection depend on *both* of its inputs, while the two outputs
39
+ # of the nested linear model each depend on *only one* of their inputs.
40
+ expected = np.array(
41
+ [
42
+ [True, True, False, False],
43
+ [True, True, False, False],
44
+ [False, False, True, False],
45
+ [False, False, False, True],
46
+ ]
47
+ )
48
+
49
+ calculated = separability_matrix(nested)
50
+
51
+ assert calculated.shape == expected.shape
52
+ assert np.array_equal(calculated, expected)

Preventing abuse

Safeguarding against malicious applications of AI-driven software engineering, such as malware development, is increasingly critical. At the same time, it’s important that protective measures do not unduly hinder legitimate and beneficial applications that may involve techniques sometimes also used for malware development, such as low level kernel engineering.

To balance safety and utility, Codex was trained to identify and precisely refuse requests aimed at development of malicious software, while clearly distinguishing and supporting legitimate tasks. We've also enhanced our policy frameworks and incorporated rigorous safety evaluations to reinforce these boundaries effectively. We’ve published an addendum to the o3 System Card⁠ to reflect these evaluations.

Secure execution

The Codex agent operates entirely within a secure, isolated container in the cloud. During task execution, internet access is disabled, limiting the agent’s interaction solely to the code explicitly provided via GitHub repositories and pre-installed dependencies configured by the user via a setup script. The agent cannot access external websites, APIs, or other services.

Early use cases

Technical teams at OpenAI have started using Codex as part of their daily toolkit. It is most often used by OpenAI engineers to offload repetitive, well-scoped tasks, like refactoring, renaming, and writing tests, that would otherwise break focus. It’s equally useful for scaffolding new features, wiring components, fixing bugs, and drafting documentation. Teams are building new habits around it: triaging on-call issues, planning tasks at the start of the day, and offloading background work to keep moving. By reducing context-switching and surfacing forgotten to-dos, Codex helps engineers ship faster and stay focused on what matters most.

 

Leading up to release, we've also been working with a small group of external testers to better understand how Codex performs across diverse codebases, development processes, and teams.

  • Cisco(opens in a new window) is exploring how Codex can help their engineering teams bring ambitious ideas to life faster. As early design partners, Cisco is helping shape the future of Codex by evaluating it for real-world use cases across their product portfolio and providing feedback to the OpenAI team.
  • Temporal(opens in a new window) uses Codex to accelerate feature development, debug issues, write and execute tests, and refactor large codebases. It also helps them stay focused by running complex tasks in the background—keeping engineers in flow while speeding up iteration.
  • Superhuman(opens in a new window) uses Codex to speed up small but repetitive tasks like improving test coverage and fixing integration failures. It also helps them ship faster by enabling product managers to contribute lightweight code changes without pulling in an engineer, except for code review.
  • Kodiak(opens in a new window) is using Codex to help write debugging tools, improve test coverage, and refactor code—accelerating development of the Kodiak Driver, their autonomous driving technology. Codex has also become a valuable reference tool, helping engineers understand unfamiliar parts of the stack by surfacing relevant context and past changes.

Based on learnings from early testers, we recommend assigning well-scoped tasks to multiple agents simultaneously, and experimenting with different types of tasks and prompts to explore the model’s capabilities effectively.

Updates to Codex CLI

Last month, we launched Codex CLI, a lightweight open-source coding agent that runs in your terminal. It brings the power of models like o3 and o4-mini into your local workflow, making it easy to pair with them to complete tasks faster. 

Today, we’re also releasing a smaller version of codex-1, a version of o4-mini designed specifically for use in Codex CLI. This new model supports faster workflows in the CLI and is optimized for low-latency code Q&A and editing, while retaining the same strengths in instruction following and style. It’s available now as the default model in Codex CLI and in the API as codex-mini-latest. The underlying snapshot will be regularly updated as we continue to improve the Codex-mini model.

We’re also making it much easier to connect your developer account to Codex CLI. Instead of manually generating and configuring an API token, you can now sign in with your ChatGPT account and select the API organization you want to use. We’ll automatically generate and configure the API key for you. Plus and Pro users who sign in to Codex CLI with ChatGPT can also begin redeeming $5 and $50 in free API credits, respectively, later today for the next 30 days.

Codex availability, pricing, and limitations

Starting today, we’re rolling out Codex to ChatGPT Pro, Enterprise, and Team users globally, with support for Plus and Edu coming soon. Users will have generous access at no additional cost for the coming weeks so you can explore what Codex can do, after which we’ll roll out rate-limited access and flexible pricing options that let you purchase additional usage on-demand. We plan to expand access to Plus and Edu users soon.

For developers building with codex-mini-latest, the model is available on the Responses API and priced at $1.50 per 1M input tokens and $6 per 1M output tokens, with a 75% prompt caching discount.

Codex is still early in its development. As a research preview, it currently lacks features like image inputs for frontend work, and the ability to course-correct the agent while it's working. Additionally, delegating to a remote agent takes longer than interactive editing, which can take some getting used to. Over time, interacting with Codex agents will increasingly resemble asynchronous collaboration with colleagues. As model capabilities advance, we anticipate agents handling more complex tasks over extended periods.

What’s next

We imagine a future where developers drive the work they want to own and delegate the rest to agents—moving faster and being more productive with AI. To achieve that, we’re building a suite of Codex tools that support both real-time collaboration and asynchronous delegation. 

Pairing with AI tools like Codex CLI and others has quickly become an industry norm, helping developers move faster as they code. But we believe the asynchronous, multi-agent workflow introduced by Codex in ChatGPT will become the de facto way engineers produce high-quality code.

Ultimately, we see these two modes of interaction—real-time pairing and task delegation—converging. Developers will collaborate with AI agents across their IDEs and everyday tools to ask questions, get suggestions, and offload longer tasks, all in a unified workflow.

Looking ahead, we plan to introduce more interactive and flexible agent workflows. Developers will soon be able to provide guidance mid-task, collaborate on implementation strategies, and receive proactive progress updates. We also envision deeper integrations across the tools you already use: today Codex connects with GitHub, and soon you’ll be able to assign tasks from Codex CLI, ChatGPT Desktop, or even tools such as your issue tracker or CI system.

Software engineering is one of the first industries to experience significant AI-driven productivity gains, opening new possibilities for individuals and small teams. While we’re optimistic about these gains, we’re also collaborating with partners to better understand the implications of widespread agent adoption on developer workflows, skill development across people, skill levels, and geographies. 

This is just the beginning—and we’re excited to see what you build with Codex.

 
 

Appendix

System message

We are sharing the codex-1 system message to help developers understand the model’s default behavior and tailor Codex to work effectively in custom workflows. For example, the codex-1 system message encourages Codex to run all tests mentioned in the AGENTS.md file, but if you’re short on time, you can ask Codex to skip these tests.