Skip to main content

Command Palette

Search for a command to run...

OpenAI WebSocket Mode for Responses API: How to Use It and Why It's a Game-Changer for AI Agents (2026)

OpenAI just launched WebSocket mode for the Responses API — a persistent connection that cuts latency by up to 40% in agentic workflows. Learn how it works, how to integrate it in voice agents, and the top use cases you need to know.

Published
7 min read
OpenAI WebSocket Mode for Responses API: How to Use It and Why It's a Game-Changer for AI Agents (2026)

OpenAI has officially launched WebSocket mode for its Responses API (wss://api.openai.com/v1/responses) — a persistent, low-latency connection designed specifically for long-running agentic workflows. If you're building AI agents that loop through dozens of tool calls, this is the most impactful infrastructure update OpenAI has shipped in recent months.


What Is WebSocket Mode for the Responses API?

Unlike the traditional HTTP REST approach where every turn opens a brand-new connection, WebSocket mode lets your agent maintain a single persistent connection to /v1/responses across the entire workflow.

Each new turn sends only incremental inputs (new user messages or tool outputs) along with a previous_response_id reference — no need to resend full conversation history. This is made possible by a connection-local in-memory cache that the server keeps for your most recent response on that socket.

Key distinction: This is different from the existing OpenAI Realtime API (wss://api.openai.com/v1/realtime), which handles speech-to-speech audio. The new WebSocket mode is for the text/chat Responses API, aimed at orchestration, agentic coding, and tool-heavy pipelines.


Why This Matters: Performance Gains

The old pattern — HTTP polling with full context resent each turn — adds significant overhead in agents that call many tools. WebSocket mode directly fixes this.

Workflow Type

HTTP REST Pattern

WebSocket Mode

Single turn Q&A

Fine

Fine

5–10 tool call loop

Moderate overhead

Faster

20+ tool call chain

High overhead, slow

Up to ~40% faster

ZDR / store=false

Works

Fully compatible

Parallel runs

N/A

Multiple connections needed

The in-memory cache is the key. Instead of re-hydrating context from disk on every turn, the server reuses connection-local state for continuation — making each round-trip meaningfully faster in long agent loops.


How to Connect: Step-by-Step

Step 1 — Open the Connection

Install the websocket-client library if using Python (pip install websocket-client), then connect with your API key:

from websocket import create_connection
import json, os

ws = create_connection(
    "wss://api.openai.com/v1/responses",
    header=[
        f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}",
    ],
)

Step 2 — Send Your First response.create

Fire the first turn with the full system prompt, tools, and the user's initial message:

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "input": [
        {
            "type": "message",
            "role": "user",
            "content": [{"type": "input_text", "text": "Find the fizz_buzz() function in my codebase."}]
        }
    ],
    "tools": [
        # your tool definitions here
    ]
}))

Step 3 — Continue Turns Incrementally

For every follow-up turn, only send new inputs + chain from the previous response ID. Never resend full conversation history.

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "previous_response_id": "resp_abc123",
    "input": [
        {
            "type": "function_call_output",
            "call_id": "call_xyz",
            "output": "{ 'result': 'function found at line 42' }",
        },
        {
            "type": "message",
            "role": "user",
            "content": [{"type": "input_text", "text": "Now optimize it for performance."}],
        }
    ],
    "tools": []
}))

Step 4 — Warm Up for Faster First-Turn Response

Pre-warm the connection with generate: false to load context into cache before the user speaks:

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "generate": False,
    "input": [
        {"type": "message", "role": "system",
         "content": [{"type": "input_text", "text": "You are a helpful booking assistant."}]}
    ],
    "tools": []
}))

Integrating With Voice Agents

The WebSocket Responses API is the orchestration brain of your voice agent pipeline. Here's the full architecture:

User speaks
    ↓
[STT — Whisper / Deepgram]
    ↓  (transcript text)
[Responses API WebSocket] ← persistent connection
    ↓  (text + tool calls)
[Tool Execution Layer]  (calendar, CRM, search, etc.)
    ↓  (tool result)
[Responses API WebSocket] ← incremental continuation
    ↓  (final text response)
[TTS — OpenAI TTS / ElevenLabs]
    ↓
User hears response

Why not just use the Realtime API for everything? The Realtime API (/v1/realtime) is best for native speech-to-speech without intermediate text. But if you need custom tool execution logic, text processing middleware, or store=false ZDR compliance, the Responses API WebSocket + STT + TTS pattern gives you far more control.


Key Use Cases

1. Agentic Coding Assistants

An AI coding agent that runs read_file → analyze → edit → run_tests → fix → run_tests in a loop is exactly what this is built for. With 20+ tool call chains being up to 40% faster, coding agents like Cursor-style tools benefit enormously.

2. Voice-Based Customer Support Bots

Phone bots (built with Twilio, Plivo, or Exotel) can now use the Responses API WebSocket as the brain — keeping one persistent connection open per call session, handling CRM lookups, booking confirmations, and escalation logic through tool calls, all over a single socket.

3. Real-Time Orchestration Pipelines

Multi-agent orchestration systems — where a supervisor agent delegates tasks to sub-agents — benefit from incremental input continuation. Each delegation round trip doesn't re-upload the full context.

4. Long-Running Research Agents

An agent that browses the web, reads documents, calls search APIs, and synthesizes answers can now run a full 30–50 step pipeline without latency overhead accumulating at every turn.

5. AI Tutors and Learning Bots

Educational platforms running multi-turn Socratic dialogue with code execution and adaptive questioning can maintain session state on one persistent connection per student, with clean ZDR compliance for student data privacy.


How It Improves Existing Agents

  • No repeated context uploads — only new items are sent per turn, not the full thread

  • Connection-local cache — the server reuses in-memory state instead of loading from disk on each turn

  • ZDR-compatible — works with store=false, so no conversation data is persisted to OpenAI servers

  • Warmup support — pre-load tools and instructions before the user's first message to eliminate cold-start latency

  • Sequential safety — runs are executed one at a time on a connection, preventing race conditions


Connection Limits and Error Handling

  • Max 60 minutes per WebSocket connection — implement a reconnect handler that resumes from the last response_id

  • No multiplexing — if you need parallel agent runs, open separate connections

  • previous_response_not_found — returned when the cached ID is missing; handle by sending full context again or using /responses/compact first

def reconnect_and_continue(last_response_id, full_context):
    ws = create_connection(
        "wss://api.openai.com/v1/responses",
        header=[f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}"]
    )
    ws.send(json.dumps({
        "type": "response.create",
        "model": "gpt-5.2",
        "store": True,
        "previous_response_id": last_response_id,
        "input": full_context,
        "tools": []
    }))
    return ws

/responses/compact — Your Context Window Safety Net

For very long agent runs that approach context limits, use /responses/compact to compress history, then start a fresh chain:

compacted = client.responses.compact(model="gpt-5.2", input=long_input_array)

ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-5.2",
    "store": False,
    "previous_response_id": None,
    "input": [
        *compacted.output,
        {"type": "message", "role": "user",
         "content": [{"type": "input_text", "text": "Continue from here."}]}
    ],
    "tools": []
}))

Quick Reference: Which Transport to Use

Scenario

Best Transport

Browser voice app (mic input)

WebRTC (/v1/realtime)

Server-to-server speech-to-speech

WebSocket Realtime API

Server agent with many tool calls

WebSocket Responses API (new)

Simple single-turn chat

HTTP REST /v1/responses

Long agentic coding / research runs

WebSocket Responses API (new)

OpenAI's new WebSocket mode for the Responses API marks a clear architectural shift — from stateless HTTP calls to stateful, session-aware agent connections. For any developer building production AI agents in 2026, this is the right transport layer to adopt now.