Introducing Codex

Today we’re launching a research preview of Codex: a cloud-based software engineering agent that can work on many tasks in parallel. Codex can perform tasks for you such as writing features, answering questions about your codebase, fixing bugs, and proposing pull requests for review; each task runs in its own cloud sandbox environment, preloaded with your repository.

Codex is powered by codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result. We’re starting to roll out Codex to ChatGPT Pro, Enterprise, and Team users today, with support for Plus and Edu coming soon.

How Codex works

Today you can access Codex through the sidebar in ChatGPT and assign it new coding tasks by typing a prompt and clicking “Code”. If you want to ask Codex a question about your codebase, click “Ask”. Each task is processed independently in a separate, isolated environment preloaded with your codebase. Codex can read and edit files, as well as run commands including test harnesses, linters, and type checkers. Task completion typically takes between 1 and 30 minutes, depending on complexity, and you can monitor Codex’s progress in real time.

Once Codex completes a task, it commits its changes in its environment. Codex provides verifiable evidence of its actions through citations of terminal logs and test outputs, allowing you to trace each step taken during task completion. You can then review the results, request further revisions, open a GitHub pull request, or directly integrate the changes into your local environment. In the product, you can configure the Codex environment to match your real development environment as closely as possible.

Codex can be guided by AGENTS.md files placed within your repository. These are text files, akin to README.md, where you can inform Codex how to navigate your codebase, which commands to run for testing, and how best to adhere to your project's standard practices. Like human developers, Codex agents perform best when provided with configured dev environments, reliable testing setups, and clear documentation.

On coding evaluations and internal benchmarks, codex-1 shows strong performance even without AGENTS.md files or custom scaffolding.

1248Number of attempts0.650.70.750.80.85Accuracy (pass@k)codex-1o3-highSWE-Bench Verified

11%67%70%75%OpenAI Internal SWE taskso1-higho4-mini-higho3-highcodex-10%40%80%Accuracy (%)

23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded. codex-1 was tested at a maximum context length of 192k tokens and medium ‘reasoning effort’, which is the setting that will be available in product today. For details on o3 evaluations, see here⁠.

Our internal SWE task benchmark is a curated set of real-world internal SWE tasks at OpenAI.

Building safe and trustworthy agents

We're releasing Codex as a research preview, in line with our iterative deployment strategy. We prioritized security and transparency when designing Codex so users can verify its outputs - a safeguard that grows increasingly more important as AI models handle more complex coding tasks independently and safety considerations evolve. Users can check Codex’s work through citations, terminal logs and test results. When uncertain or faced with test failures, the Codex agent explicitly communicates these issues, enabling users to make informed decisions about how to proceed. It still remains essential for users to manually review and validate all agent-generated code before integration and execution.

Code-review screenshot with a test-file overlay verifying quoted filenames, plus summary and passing tests on a blue backdrop.

Code-review screenshot with a black terminal overlay showing one passing test for quoted filenames; summary and diff of the ‘Fix /diff error with special characters’ change visible on a blue-pastel background.

Aligning to human preferences

A primary goal while training codex-1 was to align outputs closely with human coding preferences and standards. Compared to OpenAI o3, codex-1 consistently produces cleaner patches ready for immediate human review and integration into standard workflows.

Please fix the following issue in the astropy/astropy repository. Please resolve the issue in the problem below by editing and testing code files in your current code execution session. The repository is cloned in the /testbed folder. You must fully solve the problem for your answer to be considered correct. Problem statement:Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels Consider the following model: ```python from astropy.modeling import models as m from astropy.modeling.separable import separability_matrix cm = m.Linear1D(10) & m.Linear1D(5) ``` It's separability matrix as you might expect is a diagonal: ```python >>> separability_matrix(cm) array([[ True, False], [False, True]]) ``` If I make the model more complex: ```python >>> separability_matrix(m.Pix2Sky_TAN() & m.Linear1D(10) & m.Linear1D(5)) array([[ True, True, False, False], [ True, True, False, False], [False, False, True, False], [False, False, False, True]]) ``` The output matrix is again, as expected, the outputs and inputs to the linear models are separable and independent of each other. If however, I nest these compound models: ```python >>> separability_matrix(m.Pix2Sky_TAN() & cm) array([[ True, True, False, False], [ True, True, False, False], [False, False, True, True], [False, False, True, True]]) ``` Suddenly the inputs and outputs are no longer separable? This feels like a bug to me, but I might be missing something?

Expand

Codex

astropy/modeling/separable.py

+@@ -242,7 +242,7 @@ def _cstack(left, right):
  cright = _coord_matrix(right, 'right', noutp)
  else:
  cright = np.zeros((noutp, right.shape[1]))
 - cright[-right.shape[0]:, -right.shape[1]:] = 1
 + cright[-right.shape[0]:, -right.shape[1]:] = right
  return np.hstack([cleft, cright])

astropy/modeling/tests/test_separable.py

+@@ -148,3 +148,17 @@ def test_custom_model_separable():
  assert not model_c().separable
  assert np.all(separability_matrix(model_c()) == [True, True])
 +
 +
 +def test_nested_compound_model_separability():
 + """Regression test for nested compound models in separability_matrix."""
 + lin_compound = models.Linear1D(10) & models.Linear1D(5)
 + nested = models.Pix2Sky_TAN() & lin_compound
 + expected = np.array([
 + [True, True, False, False],
 + [True, True, False, False],
 + [False, False, True, False],
 + [False, False, False, True],
 + ])
 + result = separability_matrix(nested)
 + assert_allclose(result, expected)

OpenAI o3

astropy/modeling/separable.py

+@@ -241,8 +241,26 @@ def _cstack(left, right):
  if isinstance(right, Model):
  cright = _coord_matrix(right, 'right', noutp)
  else:
 + # Insert the separability information contained in the right-hand
 + # operand into the bottom-right corner of the combined matrix.
 + #
 + # Prior to this change the implementation *discarded* the actual
 + # separability matrix of the right operand and replaced it with an
 + # array of ones. This effectively marked every output that originated
 + # from the right‑hand side as depending on *all* of its inputs, i.e.
 + # it treated the nested model as completely non‑separable. When the
 + # right operand was itself a compound model this led to overly
 + # conservative (and therefore incorrect) separability matrices—for
 + # example when combining an already analysed compound model with the
 + # ``&`` operator, as reported in gh‑issue #xxxx (see regression tests
 + # added alongside this fix).
 + #
 + # The correct behaviour is analogous to the handling of the *left*
 + # array operand a few lines above: we should embed the pre‑computed
 + # separability matrix of the right operand into the appropriate slice
 + # of the result without altering it.
  cright = np.zeros((noutp, right.shape[1]))
 - cright[-right.shape[0]:, -right.shape[1]:] = 1
 + cright[-right.shape[0]:, -right.shape[1]:] = right
  return np.hstack([cleft, cright])

astropy/modeling/tests/test_separability_matrix.py

 +"""Tests for the separability_matrix utility.
 +
 +This file adds a regression test for the bug discussed in
 +astropy/astropy#14966 (see also the accompanying change in
 +``astropy.modeling.separable._cstack``).
 +
 +The failure was triggered when computing the separability matrix for a
 +compound model where the right‑hand side of an ``&`` operation was *itself*
 +another compound model. In that case the implementation treated the nested
 +model as completely non‑separable, resulting in a matrix that incorrectly
 +reported dependencies between all of its outputs and inputs.
 +
 +The test below reproduces the example from the original bug report and makes
 +sure that the corrected implementation returns the expected block‑diagonal
 +matrix.
 +"""
 +
 +import numpy as np
 +
 +
 +from astropy.modeling import models as m
 +from astropy.modeling.separable import separability_matrix
 +
 +
 +def test_nested_compound_model_separability():
 + """Ensure separability_matrix works for nested compound models."""
 +
 + # A simple separable compound model consisting of two independent 1‑D
 + # linear transformations.
 + cm = m.Linear1D(10) & m.Linear1D(5)
 +
 + # Combine the separable compound model with a *non‑separable* two‑dimensional
 + # TAN projection. The projection goes on the left‑hand side of the ``&``
 + # operator while the nested compound model ``cm`` is on the right.
 + nested = m.Pix2Sky_TAN() & cm
 +
 + # The expected correlation matrix is block‑diagonal: the two outputs of
 + # the TAN projection depend on *both* of its inputs, while the two outputs
 + # of the nested linear model each depend on *only one* of their inputs.
 + expected = np.array(
 + [
 + [True, True, False, False],
 + [True, True, False, False],
 + [False, False, True, False],
 + [False, False, False, True],
 + ]
 + )
 +
 + calculated = separability_matrix(nested)
 +
 + assert calculated.shape == expected.shape
 + assert np.array_equal(calculated, expected)

Please fix the following issue in the matplotlib/matplotlib repository. Please resolve the issue in the problem below by editing and testing code files in your current code execution session. The repository is cloned in the /testbed folder. You must fully solve the problem for your answer to be considered correct. Problem statement:[Bug]: Windows correction is not correct in `mlab._spectral_helper` ### Bug summary Windows correction is not correct in `mlab._spectral_helper`: https://github.com/matplotlib/matplotlib/blob/3418bada1c1f44da1f73916c5603e3ae79fe58c1/lib/matplotlib/mlab.py#L423-L430 The `np.abs` is not needed, and give wrong result for window with negative value, such as `flattop`. For reference, the implementation of scipy can be found here : https://github.com/scipy/scipy/blob/d9f75db82fdffef06187c9d8d2f0f5b36c7a791b/scipy/signal/_spectral_py.py#L1854-L1859 ### Code for reproduction ```python import numpy as np from scipy import signal window = signal.windows.flattop(512) print(np.abs(window).sum()**2-window.sum()**2) ``` ### Actual outcome 4372.942556173262 ### Expected outcome 0 ### Additional information _No response_ ### Operating system _No response_ ### Matplotlib Version latest ### Matplotlib Backend _No response_ ### Python version _No response_ ### Jupyter version _No response_ ### Installation None

Expand

Codex

lib/matplotlib/mlab.py

+@@ -424,10 +424,10 @@ def _spectral_helper(x, y=None, NFFT=None, Fs=None, detrend_func=None,
  result /= Fs
  # Scale the spectrum by the norm of the window to compensate for
  # windowing loss; see Bendat & Piersol Sec 11.5.2.
 - result /= (np.abs(window)**2).sum()
 + result /= (window.conjugate() * window).sum()
  else:
  # In this case, preserve power in the segment, not amplitude
 - result /= np.abs(window).sum()**2
 + result /= window.sum()**2
  t = np.arange(NFFT/2, len(x) - NFFT/2 + 1, NFFT - noverlap)/Fs

lib/matplotlib/tests/test_mlab.py

+@@ -715,6 +715,32 @@ class TestSpectral:
  spec_n/self.Fs*win.sum()**2,
  atol=1e-08)
 + def test_psd_windowarray_negative(self):
 + win = np.array([-1.0, 0.5, 0.5, 1.0])
 +
 + y = self.y[:len(win)]
 +
 + spec_s, fsp_s = mlab.psd(x=y,
 + NFFT=len(win),
 + Fs=self.Fs,
 + noverlap=0,
 + pad_to=len(win),
 + sides=self.sides,
 + window=lambda x: win,
 + scale_by_freq=True)
 + spec_n, fsp_n = mlab.psd(x=y,
 + NFFT=len(win),
 + Fs=self.Fs,
 + noverlap=0,
 + pad_to=len(win),
 + sides=self.sides,
 + window=lambda x: win,
 + scale_by_freq=False)
 + assert_array_equal(fsp_s, fsp_n)
 + assert_allclose(spec_s*(win.conjugate() * win).sum(),
 + spec_n/self.Fs*win.sum()**2,
 + atol=1e-08)
 +
  @pytest.mark.parametrize(
  "kind", ["complex", "magnitude", "angle", "phase"])
  def test_spectrum(self, kind):

OpenAI o3

lib/matplotlib/mlab.py

+@@ -395,12 +395,23 @@ def _spectral_helper(x, y=None, NFFT=None, Fs=None, detrend_func=None,
  elif mode == 'psd':
  result = np.conj(result) * result
  elif mode == 'magnitude':
 - result = np.abs(result) / np.abs(window).sum()
 + # For the magnitude spectrum we need to normalise by the *sum* of the
 + # window to compensate for the change in overall signal level caused
 + # by windowing. Using ``np.abs(window)`` here would incorrectly
 + # double‑count negative lobes of some windows (e.g., *flattop*) when
 + # they are squared later in power calculations, but for the magnitude
 + # spectrum we must also ensure that the normalisation factor is
 + # positive so that the returned magnitudes remain non‑negative (as
 + # they will be passed through ``np.log10`` when ``scale='dB'``). We
 + # therefore take the absolute value *after* summing the window.
 + result = np.abs(result) / np.abs(window.sum())
  elif mode == 'angle' or mode == 'phase':
  # we unwrap the phase later to handle the onesided vs. twosided case
  result = np.angle(result)
  elif mode == 'complex':
 - result /= np.abs(window).sum()
 + # Apply the same normalisation as for the magnitude spectrum (but
 + # without taking an absolute value of the result itself).
 + result /= np.abs(window.sum())
  if mode == 'psd':
+@@ -424,10 +435,20 @@ def _spectral_helper(x, y=None, NFFT=None, Fs=None, detrend_func=None,
  result /= Fs
  # Scale the spectrum by the norm of the window to compensate for
  # windowing loss; see Bendat & Piersol Sec 11.5.2.
 - result /= (np.abs(window)**2).sum()
 + # Scale the spectrum by the *power* of the window to compensate
 + # for windowing loss (Bendat & Piersol Sec 11.5.2). The power is
 + # proportional to the *square* of the window, so we should sum
 + # ``window ** 2`` – taking the absolute value first would
 + # mistakenly treat negative window coefficients as positive and
 + # thus over‑estimate the power for windows that are not strictly
 + # non‑negative (again, the *flattop* window is a common example).
 + result /= (window ** 2).sum()
  else:
  # In this case, preserve power in the segment, not amplitude
 - result /= np.abs(window).sum()**2
 + # Preserve the power in the segment (rather than the amplitude).
 + # The correct normalisation is by the square of the sum of the
 + # window, *without* an absolute value – see the discussion above.
 + result /= window.sum() ** 2
  t = np.arange(NFFT/2, len(x) - NFFT/2 + 1, NFFT - noverlap)/Fs

Please fix the following issue in the django/django repository. Please resolve the issue in the problem below by editing and testing code files in your current code execution session. The repository is cloned in the /testbed folder. You must fully solve the problem for your answer to be considered correct. Problem statement:durations-only expressions doesn't work on SQLite and MySQL Description class Experiment(models.Model): estimated_time = models.DurationField() list(Experiment.objects.annotate(duration=F('estimated_time') + datime.timedelta(1))) Traceback (most recent call last): File "/home/sergey/dev/django/tests/expressions/tests.py", line 1218, in test_duration_expressions list(Experiment.objects.annotate(duration=F('estimated_time') + delta)) File "/home/sergey/dev/django/django/db/models/query.py", line 269, in __iter__ self._fetch_all() File "/home/sergey/dev/django/django/db/models/query.py", line 1172, in _fetch_all self._result_cache = list(self._iterable_class(self)) File "/home/sergey/dev/django/django/db/models/query.py", line 63, in __iter__ for row in compiler.results_iter(results): File "/home/sergey/dev/django/django/db/models/sql/compiler.py", line 998, in apply_converters value = converter(value, expression, connection) File "/home/sergey/dev/django/django/db/backends/base/operations.py", line 571, in convert_durationfield_value value = str(decimal.Decimal(value) / decimal.Decimal(1000000)) decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>]

Expand

Codex

django/db/backends/base/operations.py

+@@ -8,6 +8,7 @@ from django.conf import settings
 from django.db import NotSupportedError, transaction
 from django.db.backends import utils
 from django.utils import timezone
 +from django.utils.dateparse import parse_duration
 from django.utils.encoding import force_str
+@@ -579,8 +580,21 @@ class BaseDatabaseOperations:
  return []
  def convert_durationfield_value(self, value, expression, connection):
 - if value is not None:
 - return datetime.timedelta(0, 0, value)
 + if value is None:
 + return None
 + if isinstance(value, datetime.timedelta):
 + return value
 + if isinstance(value, str):
 + parsed = parse_duration(value)
 + if parsed is not None:
 + return parsed
 + try:
 + microseconds = int(value)
 + except (TypeError, ValueError, decimal.DecimalException):
 + pass
 + else:
 + return datetime.timedelta(microseconds=microseconds)
 + raise ValueError("Invalid duration value '%s'" % value)
  def check_expression_support(self, expression):
  """

django/db/models/expressions.py

+@@ -487,8 +487,18 @@ class DurationExpression(CombinedExpression):
  pass
  else:
  if output.get_internal_type() == 'DurationField':
 - sql, params = compiler.compile(side)
 - return connection.ops.format_for_duration_arithmetic(sql), params
 + # Only format DurationField expressions when combined with
 + # date/time types.
 + other = self.rhs if side is self.lhs else self.lhs
 + try:
 + other_output = other.output_field
 + except FieldError:
 + other_output = None
 + if other_output and other_output.get_internal_type() in {
 + 'DateField', 'DateTimeField', 'TimeField'
 + }:
 + sql, params = compiler.compile(side)
 + return connection.ops.format_for_duration_arithmetic(sql), params
  return compiler.compile(side)
  def as_sql(self, compiler, connection):

tests/expressions/tests.py

+@@ -1468,6 +1468,12 @@ class FTimeDeltaTests(TestCase):
  ))
  self.assertIsNone(queryset.first().shifted)
 + def test_durationfield_only_expression(self):
 + delta = datetime.timedelta(days=1)
 + experiments = Experiment.objects.annotate(duration=F('estimated_time') + delta).order_by('name')
 + for e in experiments:
 + self.assertEqual(e.duration, e.estimated_time + delta)
 +
  @skipUnlessDBFeature('supports_temporal_subtraction')
  def test_date_subtraction(self):
  queryset = Experiment.objects.annotate(

OpenAI o3

django/db/backends/base/operations.py

+@@ -579,8 +579,87 @@ class BaseDatabaseOperations:
  return []
  def convert_durationfield_value(self, value, expression, connection):
 - if value is not None:
 - return datetime.timedelta(0, 0, value)
 + """Convert a value as returned by the database to a ``timedelta``.
 +
 + Different database backends return different data types for
 + ``DurationField`` (and ``timedelta`` expressions):
 +
 + * PostgreSQL returns a ``datetime.timedelta`` instance.
 + * The built‑in SQLite implementation (see
 + ``django_format_dtdelta`` in ``django.db.backends.sqlite3.base``)
 + returns the string produced by ``str(timedelta)``, e.g. ``'2 days, 0:00:00'``.
 + * MySQL and SQLite can also return an integer (or the string
 + representation of one) containing the number of *microseconds*.
 +
 + Historically Django assumed the microseconds representation and
 + tried to build a new ``timedelta`` directly from it. That failed
 + whenever a backend returned something else (e.g. the string
 + representation mentioned above) which, in turn, broke duration‑
 + only expressions such as ``F('estimated_time') + timedelta(days=1)``
 + on SQLite and MySQL (#34253).
 +
 + To be backend‑agnostic, handle the common return types explicitly:
 +
 + 1. ``None`` is preserved so that NULLs remain NULL.
 + 2. ``datetime.timedelta`` is returned unchanged.
 + 3. ``int``, ``float`` and ``Decimal`` are interpreted as a number
 + of microseconds.
 + 4. ``bytes`` are decoded to ``str`` and handled as below.
 + 5. ``str`` is first parsed with :pyfunc:`django.utils.dateparse.parse_duration`.
 + If that succeeds the resulting ``timedelta`` is used. Otherwise
 + it is assumed to be a stringified integer of microseconds.
 + """
 +
 + import decimal
 + from django.utils.dateparse import parse_duration
 +
 + if value is None:
 + return None
 +
 + # 1. Already a timedelta – nothing to do.
 + if isinstance(value, datetime.timedelta):
 + return value
 +
 + # 2. Numeric types => microseconds.
 + if isinstance(value, (int, float, decimal.Decimal)):
 + # ``datetime.timedelta`` only accepts integers for the
 + # *microseconds* argument. Convert explicitly.
 + return datetime.timedelta(microseconds=int(value))
 +
 + # 3. Bytes – decode and continue as string.
 + if isinstance(value, (bytes, bytearray)):
 + try:
 + value = value.decode()
 + except Exception:
 + # Fallback – can't decode, return the original value.
 + return value
 +
 + # 4. String handling.
 + if isinstance(value, str):
 + # First, try to interpret the string as a bare integer number of
 + # microseconds. Some databases (notably MySQL) return the result
 + # of temporal arithmetic as numeric strings. Attempting this
 + # before ``parse_duration()`` avoids mis‑interpreting large
 + # microsecond values as *seconds* (see #34253).
 + try:
 + micros = int(value)
 + except (TypeError, ValueError):
 + micros = None
 +
 + if micros is not None:
 + return datetime.timedelta(microseconds=micros)
 +
 + # Otherwise, attempt to parse the standard/ISO 8601/PostgreSQL
 + # interval representations that ``parse_duration`` understands.
 + td = parse_duration(value)
 + if td is not None:
 + return td
 +
 + # Fallback – leave unchanged.
 + return value
 +
 + # Unknown type – return unchanged (best effort).
 + return value
  def check_expression_support(self, expression):
  """

You have been tasked with fixing the following issue: dd [HOLD for payment 2024-10-14] [$250] LHN - Members' room names are not updated in LHN after cache deletion If you haven’t already, check out our contributing guidelines for onboarding and email [email protected] to request to join our Slack channel! Version Number: 9.0.34-2 Reproducible in staging?: Y Reproducible in production?: Y If this was caught during regression testing, add the test name, ID and link from TestRail: https://expensify.testrail.io/index.php?/tests/view/4960406 Email or phone of affected tester (no customers): [email protected] Issue reported by: Applause Internal Team Action Performed: Prerequisites: Create a WS and add two employees to it Steps: Open NewExpensify app Log in with a WS administrator account Go to the account settings Go to the Troubleshoot menu Click on “Clear cache and restart” and clear the cache Go back to the LHN Inbox Expected Result: Members' room names should be updated and display the room name in LHN after the cache is deleted Actual Result: Members' room names are not updated in LHN after cache deletion Workaround: Unknown

Expand

Codex

src/libs/ReportUtils.ts

+@@ -3817,7 +3817,7 @@ const reportNameCache = new Map<string, {lastVisibleActionCreated: string; repor
 /**
  * Get a cache key for the report name.
  */
 -const getCacheKey = (report: OnyxEntry<Report>): string => `${report?.reportID}-${report?.lastVisibleActionCreated}-${report?.reportName}`;
 +const getCacheKey = (report: OnyxEntry<Report>, policyName: string): string => `${report?.reportID}-${report?.lastVisibleActionCreated}-${report?.reportName}-${policyName}`;
 /**
  * Get the title for a report.
+@@ -3830,12 +3830,12 @@ function getReportName(
  invoiceReceiverPolicy?: OnyxEntry<Policy>,
 ): string {
  const reportID = report?.reportID;
 - const cacheKey = getCacheKey(report);
 + const policyName = getPolicyName(report, false, policy);
 + const cacheKey = getCacheKey(report, policyName);
  if (reportID) {
  const reportNameFromCache = reportNameCache.get(cacheKey);
 -
 - if (reportNameFromCache?.reportName && reportNameFromCache.reportName === report?.reportName) {
 + if (reportNameFromCache?.reportName) {
  return reportNameFromCache.reportName;
  }
  }

OpenAI o3

src/libs/ReportUtils.ts

+@@ -562,6 +562,21 @@ let isAnonymousUser = false;
 // Example case: when we need to get a report name of a thread which is dependent on a report action message.
 const parsedReportActionMessageCache: Record<string, string> = {};
 +/**
 + * Cache of computed report names keyed by `${reportID}-${lastVisibleActionCreated}-${reportName}`.
 + *
 + * The cache is **intentionally** declared close to the top of the module so that it is already
 + * defined when we register the Onyx listener for the policy collection further below. The listener
 + * clears this cache when policies are updated in order to ensure that room titles that depend on
 + * the policy name (eg. *Members* or *#admins* rooms) are recalculated with the latest data.
 + */
 +// NOTE: `reportNameCache` is now declared near the top of this file to ensure it is available to
 +// the policy listener that needs to clear it when the collection updates. The declaration
 +// here has been removed to avoid shadowing/duplicate‑declaration errors.
 +
 +// eslint-disable-next-line @typescript-eslint/prefer-readonly-parameter-types
 +const reportNameCache = new Map<string, {lastVisibleActionCreated: string; reportName: string}>();
 +
 const defaultAvatarBuildingIconTestID = 'SvgDefaultAvatarBuilding Icon';
 Onyx.connect({
  key: ONYXKEYS.SESSION,
+@@ -598,10 +613,37 @@ Onyx.connect({
 });
 let allPolicies: OnyxCollection<Policy>;
 +/**
 + * Keep a local reference to the policy collection so we can use it when computing
 + * report names.
 + *
 + * IMPORTANT: A report title for policy rooms (e.g. *#admins*, *#announce*, the automatically
 + * generated *Members* room, Policy Expense chats, etc.) depends on the *policy name*.
 + * When the app starts with an empty cache we will first calculate the report name **before**
 + * the policy list is fetched from the server. That interim calculation is cached in
 + * `reportNameCache` so that we do not need to re‑compute the name every time the LHN renders.
 + *
 + * Once the policy information arrives we **must** clear that cache, otherwise the LHN will
 + * continue to display the stale title that was generated without a policy name. This is the
 + * root cause of Expensify/App#27948 – *“Members' room names are not updated in LHN after cache
 + * deletion”*.
 + */
 Onyx.connect({
  key: ONYXKEYS.COLLECTION.POLICY,
  waitForCollectionCallback: true,
 - callback: (value) => (allPolicies = value),
 + callback: (value) => {
 + // Update the reference used throughout ReportUtils
 + allPolicies = value;
 +
 + /**
 + * Clear the cached report names so that any subsequent call to `getReportName` will use the
 + * most up‑to‑date policy data. We don’t attempt to perform a more granular invalidation
 + * because the size of the cache is tiny (max number of reports) and the cost of
 + * recalculating a name is negligible compared to the correctness issues caused by stale
 + * data.
 + */
 + reportNameCache.clear();
 + },
 });
 let allBetas: OnyxEntry<Beta[]>;
+@@ -3812,7 +3854,9 @@ function getInvoicesChatName(report: OnyxEntry<Report>, receiverPolicy: OnyxEntr
  return getPolicyName(report, false, invoiceReceiverPolicy);
 }
 -const reportNameCache = new Map<string, {lastVisibleActionCreated: string; reportName: string}>();
 +// `reportNameCache` is declared near the top of this file. This legacy declaration has been
 +// removed to prevent duplicate constant definitions.
 +// const reportNameCache = new Map<string, {lastVisibleActionCreated: string; reportName: string}>();
 /**
  * Get a cache key for the report name.

Preventing abuse

Safeguarding against malicious applications of AI-driven software engineering, such as malware development, is increasingly critical. At the same time, it’s important that protective measures do not unduly hinder legitimate and beneficial applications that may involve techniques sometimes also used for malware development, such as low level kernel engineering.

To balance safety and utility, Codex was trained to identify and precisely refuse requests aimed at development of malicious software, while clearly distinguishing and supporting legitimate tasks. We've also enhanced our policy frameworks and incorporated rigorous safety evaluations to reinforce these boundaries effectively. We’ve published an addendum to the o3 System Card⁠ to reflect these evaluations.

Secure execution

The Codex agent operates entirely within a secure, isolated container in the cloud. During task execution, internet access is disabled, limiting the agent’s interaction solely to the code explicitly provided via GitHub repositories and pre-installed dependencies configured by the user via a setup script. The agent cannot access external websites, APIs, or other services.

Early use cases

Technical teams at OpenAI have started using Codex as part of their daily toolkit. It is most often used by OpenAI engineers to offload repetitive, well-scoped tasks, like refactoring, renaming, and writing tests, that would otherwise break focus. It’s equally useful for scaffolding new features, wiring components, fixing bugs, and drafting documentation. Teams are building new habits around it: triaging on-call issues, planning tasks at the start of the day, and offloading background work to keep moving. By reducing context-switching and surfacing forgotten to-dos, Codex helps engineers ship faster and stay focused on what matters most.

00:0002:14

00:0000:00

Leading up to release, we've also been working with a small group of external testers to better understand how Codex performs across diverse codebases, development processes, and teams.

Cisco⁠(opens in a new window) is exploring how Codex can help their engineering teams bring ambitious ideas to life faster. As early design partners, Cisco is helping shape the future of Codex by evaluating it for real-world use cases across their product portfolio and providing feedback to the OpenAI team.
Temporal⁠(opens in a new window) uses Codex to accelerate feature development, debug issues, write and execute tests, and refactor large codebases. It also helps them stay focused by running complex tasks in the background—keeping engineers in flow while speeding up iteration.
Superhuman⁠(opens in a new window) uses Codex to speed up small but repetitive tasks like improving test coverage and fixing integration failures. It also helps them ship faster by enabling product managers to contribute lightweight code changes without pulling in an engineer, except for code review.
Kodiak⁠(opens in a new window) is using Codex to help write debugging tools, improve test coverage, and refactor code—accelerating development of the Kodiak Driver, their autonomous driving technology. Codex has also become a valuable reference tool, helping engineers understand unfamiliar parts of the stack by surfacing relevant context and past changes.

Based on learnings from early testers, we recommend assigning well-scoped tasks to multiple agents simultaneously, and experimenting with different types of tasks and prompts to explore the model’s capabilities effectively.

Updates to Codex CLI

Last month, we launched Codex CLI, a lightweight open-source coding agent that runs in your terminal. It brings the power of models like o3 and o4-mini into your local workflow, making it easy to pair with them to complete tasks faster.

Today, we’re also releasing a smaller version of codex-1, a version of o4-mini designed specifically for use in Codex CLI. This new model supports faster workflows in the CLI and is optimized for low-latency code Q&A and editing, while retaining the same strengths in instruction following and style. It’s available now as the default model in Codex CLI and in the API as codex-mini-latest. The underlying snapshot will be regularly updated as we continue to improve the Codex-mini model.

We’re also making it much easier to connect your developer account to Codex CLI. Instead of manually generating and configuring an API token, you can now sign in with your ChatGPT account and select the API organization you want to use. We’ll automatically generate and configure the API key for you. Plus and Pro users who sign in to Codex CLI with ChatGPT can also begin redeeming $5 and $50 in free API credits, respectively, later today for the next 30 days.

Codex availability, pricing, and limitations

Starting today, we’re rolling out Codex to ChatGPT Pro, Enterprise, and Team users globally, with support for Plus and Edu coming soon. Users will have generous access at no additional cost for the coming weeks so you can explore what Codex can do, after which we’ll roll out rate-limited access and flexible pricing options that let you purchase additional usage on-demand. We plan to expand access to Plus and Edu users soon.

For developers building with codex-mini-latest, the model is available on the Responses API and priced at $1.50 per 1M input tokens and $6 per 1M output tokens, with a 75% prompt caching discount.

Codex is still early in its development. As a research preview, it currently lacks features like image inputs for frontend work, and the ability to course-correct the agent while it's working. Additionally, delegating to a remote agent takes longer than interactive editing, which can take some getting used to. Over time, interacting with Codex agents will increasingly resemble asynchronous collaboration with colleagues. As model capabilities advance, we anticipate agents handling more complex tasks over extended periods.

What’s next

We imagine a future where developers drive the work they want to own and delegate the rest to agents—moving faster and being more productive with AI. To achieve that, we’re building a suite of Codex tools that support both real-time collaboration and asynchronous delegation.

Pairing with AI tools like Codex CLI and others has quickly become an industry norm, helping developers move faster as they code. But we believe the asynchronous, multi-agent workflow introduced by Codex in ChatGPT will become the de facto way engineers produce high-quality code.

Ultimately, we see these two modes of interaction—real-time pairing and task delegation—converging. Developers will collaborate with AI agents across their IDEs and everyday tools to ask questions, get suggestions, and offload longer tasks, all in a unified workflow.

Looking ahead, we plan to introduce more interactive and flexible agent workflows. Developers will soon be able to provide guidance mid-task, collaborate on implementation strategies, and receive proactive progress updates. We also envision deeper integrations across the tools you already use: today Codex connects with GitHub, and soon you’ll be able to assign tasks from Codex CLI, ChatGPT Desktop, or even tools such as your issue tracker or CI system.

Software engineering is one of the first industries to experience significant AI-driven productivity gains, opening new possibilities for individuals and small teams. While we’re optimistic about these gains, we’re also collaborating with partners to better understand the implications of widespread agent adoption on developer workflows, skill development across people, skill levels, and geographies.

This is just the beginning—and we’re excited to see what you build with Codex.

Appendix

System message

We are sharing the codex-1 system message to help developers understand the model’s default behavior and tailor Codex to work effectively in custom workflows. For example, the codex-1 system message encourages Codex to run all tests mentioned in the AGENTS.md file, but if you’re short on time, you can ask Codex to skip these tests.