openaiapillmaijavascriptresponses-apistructured-outputs

OpenAI API: Responses API and Structured Outputs Specification and Implementation Guide

Sloth255
Sloth255
·11 min read·2,415 words

Introduction

Entering 2025, the OpenAI API has reached a major turning point. In March 2025, the Responses API was officially released (GA), consolidating the conversation capabilities of the Chat Completions API and the tool integration features of the Assistants API into a single endpoint. The legacy Assistants API is scheduled for decommissioning on August 26, 2026.

Furthermore, Structured Outputs, which ensures model outputs strictly adhere to a JSON schema, demonstrates its true potential when combined with the Responses API. Since its release in August 2024, it has significantly improved the reliability of agentic workflows and data extraction pipelines.

This article first outlines the overall OpenAI API landscape as of 2025, then dives into the specifications and implementation details of these two key topics.

Source Information
Specifications and performance data in this article refer to the OpenAI Official Documentation (developers.openai.com/api), the Official Migration Guide (developers.openai.com/api/docs/guides/migrate-to-responses), and the Official Blog (openai.com/index/introducing-structured-outputs-in-the-api).


1. API Landscape Overview

The major categories of the OpenAI API as of 2025 are organized as follows:

Category Endpoint Positioning
Responses API POST /v1/responses Recommended for new projects. Unified interface for agents.
Chat Completions API POST /v1/chat/completions Continued support (no planned deprecation).
Realtime API WebRTC / WebSocket / SIP Real-time bidirectional voice and text.
Embeddings POST /v1/embeddings Vector search and RAG.
Images POST /v1/images/generations Image generation and editing.
Audio POST /v1/audio/transcriptions Speech recognition and TTS.

OpenAI has positioned the Responses API as the primary destination for new features, and the decommission of the Assistants API on August 26, 2026, has been officially confirmed (Source: Official Migration Guide). While the Chat Completions API will continue to be supported without a decommissioning date, the Responses API is the current recommendation for new projects.

Structured Outputs is not an independent endpoint like those in the table above, but rather an output format control option available for both the Responses API and the Chat Completions API. It is specified using the text.format parameter in the former and the response_format parameter in the latter. This article focuses particularly on its combination with the Responses API.


2. Responses API

2.1 Overview

The Responses API is a new primitive that succeeds the Chat Completions API and integrates the features of the Assistants API. It reached General Availability (GA) in March 2025.

The most significant change is the ability to persist conversation state on the server side. Unlike traditional Chat Completions, which required including the entire conversation history in every request, the Responses API allows you to continue a conversation simply by passing a previous_response_id.

2.2 Basic Request

import OpenAI from "openai";

const client = new OpenAI();

const response = await client.responses.create({
  model: "gpt-4o",
  input: "Tell me the weather in Tokyo.",
});

console.log(response.output_text); // Retrieve text directly using the output_text helper

2.3 The output_text Helper

output_text is a convenience property provided by the OpenAI SDK, not a part of the API specification itself.

In the raw response from the Responses API, the text is nested within the following structure:

response.output[0].content[0].text

output_text summarizes the process of traversing this path. Internally, it returns the text of the first element in the output array that has type: "message" and a content block with type: "output_text".

// Both result in the same output
console.log(response.output_text);
console.log(response.output[0].content[0].text);

However, it may not work as expected if the first item in the output array is not a text message—for example, when a tool call occurs. For robust agentic code that uses tools, it is better to loop through the output array and check the type.

for (const item of response.output) {
  if (item.type === "message") {
    for (const block of item.content) {
      if (block.type === "output_text") {
        console.log(block.text);
      }
    }
  }
}

2.4 role in Input Messages

When passing an array to input, you can specify a role for each message. The role tells the model "who is making this statement," and there are three types.

role Meaning Typical Usage
system Instructions from the system (developer) Defines the model's behavior, tone, and constraints. Generally placed once at the start of a conversation.
user Input from the end-user Represents the user's statements or questions.
assistant The model's own past statements Used in multi-turn conversations to provide previous responses as history.
const response = await client.responses.create({
  model: "gpt-4o",
  input: [
    {
      role: "system",
      content: "You are a helpful Japanese assistant. Please answer concisely.",
    },
    {
      role: "user",
      content: "Tell me about JavaScript array methods.",
    },
  ],
});

If you pass a string directly to input (as in the sample in 2.2), that string is treated as a message with role: "user". Use the array format if you want fine-grained control over the model's behavior or if you want to provide system instructions.

2.5 Response Structure

Unlike Chat Completions' choices, results are returned in an output array.

{
  "id": "resp_68af4030...",
  "object": "response",
  "created_at": 1756315696,
  "model": "gpt-4o",
  "output": [
    {
      "id": "msg_68af4033...",
      "type": "message",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The weather in Tokyo is sunny."
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 15,
    "output_tokens": 12,
    "total_tokens": 27
  }
}

2.6 Multi-turn Conversation

There are two ways to achieve multi-turn conversation with the Responses API.

1. Passing History as an Array in input

This is the traditional method from the Chat Completions API. You maintain and manage the conversation history on the client side and include all messages in every request.

const history = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Tell me the weather in Tokyo." },
  { role: "assistant", content: "It is sunny in Tokyo today." },
];

const response = await client.responses.create({
  model: "gpt-4o",
  input: [...history, { role: "user", content: "How about tomorrow?" }],
});

Since state is not kept on the server, the advantage is that you can freely manipulate the history, such as deleting or summarizing specific messages. However, as the conversation grows longer, the number of input tokens increases, impacting both cost and latency.

2. Passing previous_response_id (Responses API exclusive)

This is the server-side state management method newly introduced with the Responses API. By simply passing the previous response ID, OpenAI's servers will carry over the conversation history.

// First turn
const response1 = await client.responses.create({
  model: "gpt-4o",
  input: "Tell me the weather in Tokyo.",
});

// Second turn: Context is inherited using only previous_response_id
const response2 = await client.responses.create({
  model: "gpt-4o",
  input: "How about tomorrow?",
  previous_response_id: response1.id,
});

The client doesn't need to maintain or send the entire history, keeping the request size small. Furthermore, when using reasoning models like o1 or o3, thinking tokens are preserved between turns, which is a significant advantage as it improves accuracy for tasks requiring continuous reasoning.

To use this method, the response must be saved on the server. Note that while the store parameter defaults to true, setting it to store: false will make reference by ID impossible.

Selection Criteria

Situation Recommended Method
Need to modify/edit history mid-way (RAG injection, deleting old messages, etc.) ① Array-based
Solving complex multi-turn problems with reasoning models (o1, o3, etc.) previous_response_id
Simple chat where you want to avoid manual history management previous_response_id
Using the Chat Completions API ① Array-based (previous_response_id is for Responses API only)

If persistent conversation management is required, integration with the Conversations API is also possible.

2.7 Built-in Tools

One of the greatest benefits of the Responses API is the set of built-in tools available without extra infrastructure.

const response = await client.responses.create({
  model: "gpt-4o",
  input: "Search for the latest OpenAI news and summarize it.",
  tools: [
    { type: "web_search_preview" },
    { type: "file_search" },
    { type: "code_interpreter" },
  ],
});
Tool Purpose
web_search_preview Web search equivalent to ChatGPT.
file_search RAG search over uploaded files.
code_interpreter Code execution and data analysis.
computer_use Computer operation agent.
mcp Connection to third-party MCP servers.

MCP (Model Context Protocol) Integration

Connectors are OpenAI-maintained MCP wrappers for popular services like Google Workspace or Dropbox, while Remote MCP servers are any server on the public internet that implements the remote MCP protocol (Source: OpenAI Connectors and MCP Guide).

The Responses API can integrate with remote MCP servers that support Streamable HTTP or HTTP/SSE transport protocols. When a tool is specified, the API first retrieves the list of available tools from the server (mcp_list_tools), and the model then calls the necessary tools from that list.

Basic Connection Example (Source: OpenAI Using tools Guide)

const response = await client.responses.create({
  model: "gpt-4o",
  input: "Roll 2d6 and tell me the result.",
  tools: [
    {
      type: "mcp",
      server_label: "dice_server",           // Identifier for the server (arbitrary)
      server_url: "https://example.com/mcp", // URL of the MCP server
      require_approval: "never",             // Automatically approve tool calls
    },
  ],
});

console.log(response.output_text);

Approval Control with require_approval

By default, all tool calls require explicit approval from the developer. You can control this behavior with require_approval (Source: OpenAI Connectors and MCP Guide).

const response = await client.responses.create({
  model: "gpt-4o",
  input: "Tell me about the MCP specification's transport protocols.",
  tools: [
    {
      type: "mcp",
      server_label: "deepwiki",
      server_url: "https://mcp.deepwiki.com/mcp",
      require_approval: {
        never: {
          // These two tools don't require approval; others do
          tool_names: ["ask_question", "read_wiki_structure"],
        },
      },
    },
  ],
});
require_approval Value Behavior
"never" Automatically approve all tool calls.
{ never: { tool_names: [...] } } Automatically approve specified tools; others require approval.
Omitted (Default) All tool calls require approval.

Connecting to Servers Requiring Authentication

If the MCP server requires authentication, pass the token using the headers parameter.

const response = await client.responses.create({
  model: "gpt-4o",
  input: "Fetch some data.",
  tools: [
    {
      type: "mcp",
      server_label: "my_server",
      server_url: "https://my-mcp-server.example.com/mcp",
      require_approval: "never",
      headers: {
        Authorization: `Bearer ${process.env.MCP_ACCESS_TOKEN}`,
      },
    },
  ],
});

Tool List Caching: While retrieving the tool list from the MCP server (mcp_list_tools) occurs per request, in multi-turn conversations using previous_response_id, the tool list is included in the previous response, so re-retrieval is skipped (Source: OpenAI Cookbook: MCP Tool Guide).

2.8 Key Request Parameters

Parameter Type Description
model string Model name (e.g., gpt-4o).
input string / array Text or multimodal input.
previous_response_id string Previous response ID for multi-turn.
tools array Definitions of tools to use.
text.format object Specification for Structured Outputs (see below).
stream boolean Enable streaming.
store boolean Whether to save the response on the server (default: true).
reasoning_effort string Adjust reasoning depth (low / medium / high).
background boolean Asynchronous execution in background mode.

2.9 Comparison with Chat Completions

Feature Chat Completions Responses API
Conversation State Client-side (entire history required) Server-side (previous_response_id)
Web Search Manual implementation needed Built-in (web_search_preview)
File Search / RAG Manual implementation needed Built-in (file_search)
Code Execution Manual implementation needed Built-in (code_interpreter)
MCP Connection Not supported Native support for remote MCP
Reasoning Token Persistence Discarded between turns Can be persisted
output_text Helper No Yes
Format Specification response_format text.format
New Feature Delivery Limited Primary destination

2.10 Reasoning Models (o-series)

Distinct from the GPT series, OpenAI offers a group of models called Reasoning Models. These models execute an internal step-by-step thinking process (Chain-of-Thought) before generating an answer. This internal thinking is counted as reasoning tokens, which are not included in the final output.

They demonstrate significantly higher accuracy than GPT-4o for tasks requiring multi-step reasoning, such as mathematics, coding, logical inference, and complex analysis. On the other hand, latency and costs are higher because thinking takes time.

Major models as of 2025 are:

Model Characteristics
o1 / o1-mini First generation reasoning models.
o3 / o3-mini High-precision, high-performance successor series.
o4-mini Model balanced for cost and performance.

Reasoning depth can be adjusted with the reasoning_effort parameter (see 2.8). low is a lightweight reasoning that reduces latency and cost, while high is a deep reasoning for maximum precision.

const response = await client.responses.create({
  model: "o3",
  input: "Find the general term for this sequence: 1, 1, 2, 3, 5, 8, 13, ...",
  reasoning_effort: "high",
});

Additionally, in multi-turn conversations using previous_response_id (see 2.6, method 2), reasoning tokens are persisted between turns. When digging deeper into the same problem over multiple turns, accuracy and efficiency improve because the model inherits the previous turn's reasoning instead of starting over from scratch.


3. Structured Outputs

3.1 Overview and Background

Reliably forcing LLM outputs into JSON format has been a key challenge for application integration. OpenAI has solved this incrementally:

JSON mode (legacy feature) ensures syntactically correct JSON, but does not guarantee adherence to a schema. There was a risk of missing required fields or additional unwanted fields.

Structured Outputs, released in August 2024, guarantees 100% adherence to a JSON schema specified by the developer (Source: OpenAI Official Blog).

Internal evaluations (evals) at OpenAI show that gpt-4o-2024-08-06 achieves 100% adherence to complex JSON schemas using Structured Outputs, a massive leap from the less than 40% of gpt-4-0613 (Source: OpenAI Official Blog).

3.2 How It Works

OpenAI API achieves structured outputs by converting the specified JSON Schema into a Context-Free Grammar (CFG). This grammar constrains the tokens that can be generated during sampling, enforcing schema compliance. Because of this, the first time a new schema is sent, there is additional latency for pre-processing the grammar, but subsequent requests with the same schema do not incur this penalty.

Note (Fine-tuned models): For fine-tuned models, additional latency occurs on the first request using a new schema. Subsequent requests with the same schema do not have this. Other models do not have this limitation. (Source: Structured Outputs Guide)

3.3 Two Ways to Use

Structured Outputs is provided in two forms on the API.

The first is via Function calling (tools), enabled by setting strict: true within the function definition. This is available on all models from gpt-4-0613 onwards and is suitable for connecting model capabilities with applications (e.g., accessing a database query function).

The second is via the response_format / text.format parameter, where specifying a json_schema is suitable for the model to respond to the user in a structured format (e.g., displaying different parts separately in a math tutorial UI).

3.4 Implementation Example (Responses API)

In the Responses API, the parameter has moved from response_format to text.format (Source: Official Migration Guide).

About Schema description

It is strongly recommended to include a description for each field in your JSON Schema. The description acts as an instruction to the model, providing clues to help it correctly determine what should go in that field. General field names like explanation or output are particularly prone to being misunderstood by the model without a description.

When using Zod, use .describe("..."). When writing JSON Schema directly, specify it with the "description" key inside the property object.

Schema Definition using Zod (Recommended)

import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";

const client = new OpenAI();

const Step = z.object({
  explanation: z.string().describe("Explanation of what is being done in this calculation step."),
  output: z.string().describe("The calculation result for this step (formula or numerical value)."),
});

const MathResponse = z.object({
  steps: z.array(Step).describe("A list of steps for the solution."),
  final_answer: z.string().describe("The final answer to the equation (e.g., x = -3.75)."),
});

const response = await client.responses.parse({
  model: "gpt-4o",
  input: [
    { role: "system", content: "You are a math tutor. Explain step-by-step." },
    { role: "user", content: "Solve 8x + 7 = -23" },
  ],
  text: { format: zodResponseFormat(MathResponse, "math_response") },
});

const result = response.output_parsed;
console.log(result.final_answer);
for (const step of result.steps) {
  console.log(step.explanation, "->", step.output);
}

Specifying JSON Schema Directly

const response = await client.responses.create({
  model: "gpt-4o",
  input: [
    { role: "system", content: "You are a math tutor." },
    { role: "user", content: "Solve 8x + 7 = -23" },
  ],
  text: {
    format: {
      type: "json_schema",
      name: "math_response",
      strict: true,
      schema: {
        type: "object",
        description: "A response showing the step-by-step solution of an equation.",
        properties: {
          steps: {
            type: "array",
            description: "A list of steps for the solution.",
            items: {
              type: "object",
              properties: {
                explanation: {
                  type: "string",
                  description: "Explanation of what is being done in this calculation step.",
                },
                output: {
                  type: "string",
                  description: "The calculation result for this step (formula or numerical value).",
                },
              },
              required: ["explanation", "output"],
              additionalProperties: false,
            },
          },
          final_answer: {
            type: "string",
            description: "The final answer to the equation (e.g., x = -3.75).",
          },
        },
        required: ["steps", "final_answer"],
        additionalProperties: false,
      },
    },
  },
});

Note: The correct field in the Responses API is text.format. The old response_format key is deprecated in the Responses API (Source: OpenAI Developer Community).

3.5 Structured Outputs in Function Calling

To apply Structured Outputs to a tool call, add strict: true to the function definition.

const response = await client.responses.create({
  model: "gpt-4o",
  input: "Tell me the delivery date for order #12345",
  tools: [
    {
      type: "function",
      name: "get_delivery_date",
      description: "Get the scheduled delivery date for an order.",
      strict: true,
      parameters: {
        type: "object",
        properties: {
          order_id: { type: "string" },
        },
        required: ["order_id"],
        additionalProperties: false,
      },
    },
  ],
});

Restriction: When using Structured Outputs for Function Calling, parallel_tool_calls must be set to false.

3.6 What Structured Outputs Guarantees (and What It Doesn't)

Item Status Supplement
Correct JSON Syntax ✅ Guaranteed
Adherence to Specified Schema ✅ Guaranteed (with strict: true)
Presence of Required Fields ✅ Guaranteed
Use of Values Specified in enum ✅ Guaranteed
Factual Correctness ❌ Not Guaranteed Hallucinations can occur for inputs unrelated to the schema.
Safety Policy Exemption ❌ Not Guaranteed The model may return a refusal for safety reasons.

Example of handling refusal:

const response = await client.responses.parse({
  model: "gpt-4o",
  input: [/* ... */],
  text: { format: zodResponseFormat(MathResponse, "math_response") },
});

if (response.output[0].content[0].type === "refusal") {
  console.log("Model refused:", response.output[0].content[0].refusal);
} else {
  const result = response.output_parsed;
}

3.7 Schema Constraints in strict mode

In strict: true mode, some JSON Schema features are restricted (Source: Structured Outputs Guide).

  • additionalProperties: false is required.
  • All properties must be included in the required array.
  • Direct use of anyOf at the root object is not allowed.
  • Restrictions apply to combinations like oneOf, anyOf, etc.

To prevent discrepancies between schemas and type definitions, OpenAI officially strongly recommends using the SDK with native Zod support.


4. Realtime API

The Realtime API, which reached General Availability (GA) in 2025, is a specialized API for low-latency bidirectional voice and text streaming via WebRTC, WebSocket, or SIP. It is designed for real-time interaction use cases such as voice agents connected directly from a browser or integration with telephony systems (PBX), clearly distinguishing its use from the Responses API. For more details, refer to the Official Realtime API Guide.


5. Streaming

The Responses API supports streaming in Server-Sent Events (SSE) format, allowing you to receive long responses incrementally.

const stream = await client.responses.stream({
  model: "gpt-4o",
  input: "Tell me in detail about the beginning of the universe.",
});

for await (const event of stream) {
  if (
    event.type === "response.output_text.delta" &&
    event.delta
  ) {
    process.stdout.write(event.delta);
  }
}

Structured Outputs can also be combined with streaming, in which case a complete, schema-compliant JSON is returned at the end.


6. HTTP Response Headers

API responses include headers useful for debugging and rate monitoring.

Header Content
x-request-id Unique ID for the request (required for support inquiries).
x-ratelimit-limit-requests Current RPM limit applied.
x-ratelimit-limit-tokens Current TPM limit applied.
x-ratelimit-remaining-requests Remaining number of requests.
x-ratelimit-remaining-tokens Remaining number of tokens.
x-ratelimit-reset-requests Time until RPM resets.
x-ratelimit-reset-tokens Time until TPM resets.

If you want to specify a request ID from the client side, add the X-Client-Request-Id header.


7. Rate Limits

Rate limits are applied per Organization and Project (not per user).

  • RPM (Requests Per Minute): Number of requests per minute.
  • TPM (Tokens Per Minute): Number of tokens per minute.

Usage tiers are automatically upgraded based on cumulative payments and usage history. If 429 Too Many Requests is returned, it is recommended to retry with exponential backoff (Source: Rate Limits Guide).

import retry from "async-retry";

async function callApiWithBackoff(params) {
  return retry(
    async () => {
      return await client.responses.create(params);
    },
    {
      retries: 6,
      minTimeout: 1000,
      maxTimeout: 60000,
      randomize: true,
    }
  );
}

8. Version Stability and Model Pinning

Aliases like gpt-4o are periodically updated to new snapshots, which may change the output even for the same prompt. In production, it is recommended to pin a snapshot name like gpt-4o-2024-08-06 and always perform evaluation (eval) when updating.


Summary

Point Content
New Projects Use the Responses API (POST /v1/responses).
Chat Completions Continued support. No immediate need to migrate.
Assistants API Decommissioned August 26, 2026. Migration to Responses API recommended.
Structured Outputs 100% schema compliance with strict: true + additionalProperties: false.
Responses API Usage Use text.format instead of response_format.
Real-time Voice Refer to the Realtime API (WebRTC / WebSocket / SIP).
Rate Limits Monitor headers and retry with exponential backoff.
Model Versioning Pin snapshots and perform evals in production.